Vector Models based on Normalized Frequencies
Improving Term Weights with Normalized Frequencies
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 10/27/06
Article 4 of the series Term Vector Theory and Keyword Weights
Topics
Term Weights and Keyword Spamming
Normalized Document Frequencies
Normalized Query Frequencies
Normalized Weights
The Glasgow Model
References
Term Weights and Keyword Spamming
In the Term Count and Classic Term Vector models
Eq 1: Term Count, wi = tfi
Eq 2: Classic Term Vector, 
the weight wi of a term i is considered proportional to its frequency, tfi.
Since terms with high occurrences are assigned more weight than term repeated few times, the two models are vulnerable to keyword spamming, an adversarial technique in which terms are intentionally repeated for the purpose of improving the position of a document in the search engine ranking results. In the process, ranking and retrieval is compromised. These term vector models can be made less susceptible to keyword spamming by normalizing document and query frequencies.
Normalized Document Frequencies
The normalized frequency of a term i in document j is given by
Eq 3: fi, j = tfi, j / max tfi, j
where
fi, j = normalized frequency
tfi, j = frequency of term i in document j
max tfi, j = maximum frequency of term i in document j
For example, consider a document consisting of the following term counts
major, 1
league, 2
baseball, 4
playoffs, 5
Since playoffs occurs the most the normalized frequencies are
major, 1/5 = 0.20
league, 2/5 = 0.40
baseball, 4/5 = 0.80
playoffs, 5/5 = 1
Normalized Query Frequencies
The normalized frequency of a term i in a query Q is given by
Eq 4: fQ, i = 0.5 + 0.5*tfQ, i / max tfQ, i
fQ, i = normalized frequency
tfQ, i = frequency of term i in query Q
max tfQ, i = maximum frequency of term i in query Q
For example, for the query Q = major major league the frequencies are
major, 2
league, 1
since major occurs the most in the query, the normalized frequencies are
major, (0.5 + 0.5*2/2) = 1
league, (0.5 + 0.5*1/2) = 0.75
Normalized Weights
The weight of term i in document j can be written as
Eq 5: 
and the weight of term i in query Q can be written as
Eq 6: 
These weights are then used to compute document and query vectors.
[Note: Not everyone agrees with Eq 5 and 6. Note that the normalized frequency gets a free "0.5" value even for tfQ, i = 0.]
The Glasgow Model
Over the years, several weighting schemes have been proposed. One interesting scheme is the one proposed by Mark Sanderson and Ian Ruthven in the Report on the Glasgow IR group (glair4) submission. They suggest an expression of the form
Eq 7: 
This scheme could be applied to documents and queries by
- using normalized frequencies as defined in Eq 3.
- defining the length of documents and queries as number of terms, excluding stopwords.
The main advantage of this scheme is that too long documents and queries can be penalized. Note that with slight deviations in notation, the models we have discussed use the same global weighting scheme; i.e., IDFs. In future articles, I plan to discuss other modifications made to the Vector Space Model.
Next: Implementation and Application of Term Weights in a MySQL Environment
Prev: The Classic Term Vector Model
References
- Baeza-Yates, R., Ribeiro-Neto, B; Modern Information Retrieval; Addison Wesley, 1999.
- G. Salton, C. Buckley; Term-weighting approaches in automatic retrieval
Information Processing & Management 24(5):513-523, 1988. - Report on the Glasgow IR group (glair4) submission

