Home - Contacts - Terms -

Mi Islita

Vector Models based on Normalized Frequencies

Improving Term Weights with Normalized Frequencies

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 10/27/06

Article 4 of the series Term Vector Theory and Keyword Weights

Topics

Term Weights and Keyword Spamming

Normalized Document Frequencies

Normalized Query Frequencies

Normalized Weights

The Glasgow Model

References

Term Weights and Keyword Spamming

In the Term Count and Classic Term Vector models

Eq 1: Term Count, wi = tfi
Eq 2: Classic Term Vector, Term Vector

the weight wi of a term i is considered proportional to its frequency, tfi.

Since terms with high occurrences are assigned more weight than term repeated few times, the two models are vulnerable to keyword spamming, an adversarial technique in which terms are intentionally repeated for the purpose of improving the position of a document in the search engine ranking results. In the process, ranking and retrieval is compromised. These term vector models can be made less susceptible to keyword spamming by normalizing document and query frequencies.

Normalized Document Frequencies

The normalized frequency of a term i in document j is given by

Eq 3: fi, j = tfi, j / max tfi, j

where

fi, j = normalized frequency
tfi, j = frequency of term i in document j
max tfi, j = maximum frequency of term i in document j

For example, consider a document consisting of the following term counts

major, 1
league, 2
baseball, 4
playoffs, 5

Since playoffs occurs the most the normalized frequencies are

major, 1/5 = 0.20
league, 2/5 = 0.40
baseball, 4/5 = 0.80
playoffs, 5/5 = 1

Normalized Query Frequencies

The normalized frequency of a term i in a query Q is given by

Eq 4: fQ, i = 0.5 + 0.5*tfQ, i / max tfQ, i

fQ, i = normalized frequency
tfQ, i = frequency of term i in query Q
max tfQ, i = maximum frequency of term i in query Q

For example, for the query Q = major major league the frequencies are

major, 2
league, 1

since major occurs the most in the query, the normalized frequencies are

major, (0.5 + 0.5*2/2) = 1
league, (0.5 + 0.5*1/2) = 0.75

Normalized Weights

The weight of term i in document j can be written as

Eq 5: Term Vector

and the weight of term i in query Q can be written as

Eq 6: Term Vector

These weights are then used to compute document and query vectors.

[Note: Not everyone agrees with Eq 5 and 6. Note that the normalized frequency gets a free "0.5" value even for tfQ, i = 0.]

The Glasgow Model

Over the years, several weighting schemes have been proposed. One interesting scheme is the one proposed by Mark Sanderson and Ian Ruthven in the Report on the Glasgow IR group (glair4) submission. They suggest an expression of the form

Eq 7: Term Vector

This scheme could be applied to documents and queries by

  1. using normalized frequencies as defined in Eq 3.
  2. defining the length of documents and queries as number of terms, excluding stopwords.

The main advantage of this scheme is that too long documents and queries can be penalized. Note that with slight deviations in notation, the models we have discussed use the same global weighting scheme; i.e., IDFs. In future articles, I plan to discuss other modifications made to the Vector Space Model.

Next: Implementation and Application of Term Weights in a MySQL Environment

Prev: The Classic Term Vector Model

References
  1. Baeza-Yates, R., Ribeiro-Neto, B; Modern Information Retrieval; Addison Wesley, 1999.
  2. G. Salton, C. Buckley; Term-weighting approaches in automatic retrieval
    Information Processing & Management 24(5):513-523, 1988.
  3. Report on the Glasgow IR group (glair4) submission

Thank you for using this site.
Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -