Implementation and Application of Term Weights in a MySQL Environment
MySQL Implementation of Local, Global and Normalized Term Weights
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 10/27/06
Article 5 of the series Term Vector Theory and Keyword Weights
Topics
Term Weights
MySQL Implementation
Local Weights
Global Weights
Normalized Weights
Negative Weights
Acknowledgements
References
Term Weights
Term weights can be described by a function of the form
Equation 1: wi, j = F(Li, j, Gi, Nj)
which can be represented as
Equation 2: 
Equation 1 and 2 account for local (Li, j), global (Gi) and normalized (Nj) information.
Here
- Li, j is the local weight for term i in document j
- Gi is the global weight for term i in the entire database
- Nj is the normalization factor for document j
Local weights are functions of how many times each term occurs in a document, global weights are functions of how many times documents containing each term appears in the collection, and the normalization factor corrects for discrepancies in the lengths of the documents.
In the classic "tf*IDF" vector space model

which reduces to the well-known tf*IDF weighting scheme
Equation 6:
where log(D/di) is the Inverse Document Frequency or IDF, D is the number of documents in the collection (the database size) and di is the number of documents containing term i. Equation 6 is just one of many term weighting schemes found in the literature. Models conforming to this equation are susceptible to keywords spam since these rely on term counts (i.e., the tfi,j term) and ignore document lengths. Spam abuse and precision-recall issues have lead to the development of other term weighting schemes. These differ in the way the L, G and N terms are defined.
Erica Chisholm and Tamara G. Kolda (Oak Ridge National Labs) have summarized many of these in New Term Weighting Formulas for the Vector Space (1). In addition, Ian Ruthven and Mounia Lalmas have written A survey on the use of relevance feedback for information access systems (2). Last but not least, Dik L. Lee, Huei Chuang and Kent Seamons (Hong Kong University) have compared the performance of several term vector models in their seminal paper Document Ranking and the Vector-Space Model (3). In the next paragraphs I describe very briefly the MySQL implementation of the Vector Space Model.
MySQL Implementation
A trademark of MySQL AB, Inc, MySQL is the world's most popular open source database. The company has published an internal manual for developers interested in deploying their database environment. If you are interested in optimizing a MySQL application or in learning more about text search applications, visit the MySQL Internals Manual :: 4.7 Full-text Search section (4).
In the MySQL implementation a typical application uses database tables consisting of N rows. Each row corresponds to a document, where
- Li, j = (log(dtf)+1)/sumdtf; i.e., local information based on logarithmic term counts
- Gi = log((N-nf)/nf); i.e., global information based on probabilistic IDF
- Nj = U/(1+0.0115*U); i.e., normalization based on pivoting
Here,
dtf = number of times the term appears in the document (row)
sumdtf = sum of (log(dtf)+1)'s for all terms in the same document (row)
U = number of unique terms in the document (row)
N = total number of document rows
nf = number of documents (rows) containing the term
After rearranging some terms Equation 2 is rewritten as
Equation 7: w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf)
which is the expression described in the developer's manual under Full-text Search. The MySQL developer's manual shows an example in which Equation 7 is used to compute term weights for words appearing in the title and body of documents. This example illustrates how things work.
Local Weights
MySQL defines Li, j as (log(dtf)+1)/sumdtf, where sumdtf is the sum of (log(dtf)+1)'s for all terms in the same document. Logarithms are used to adjust for within-document occurrences and word importance since words that occur x times in a document are not necessarily x times as important as words that occur once in the same document. This is a clever alternative to the BNRY, FREQ, LOGA, LOGN, and ATF1 weight representations described in Table 3, Reference 1.
So, if the title of a document is MySQL Tutorial and its content is just the phrase DBMS stands for DataBase
- unique terms without stopwords = 5; i.e., mysql, tutorial, dbms, stands, database
- (log(dtf)+1) = (log(1)+1) = 1 for each term
- sumdtf = 1 + 1 + 1 + 1 + 1 = 5
- Li, j = 1/5 = 0.2 for each term
However, if we change the title to read MySQL DBMS Tutorial
- unique terms without stopwords = 5; i.e., mysql, tutorial, dbms, stands, database
- (log(dtf)+1) = (log(2)+1) = 1.301 for dbms and 1 for the other terms.
- sumdtf = 1 + 1 + 1.301 + 1 + 1 = 5.301
- Li, j = 1.301/5.301 = 0.2454 for dbms and 1/5.301 = 0.1886 for the other terms.
Global Weights
MySQL defines global weights, Gi, as log((N-nf)/nf) --some IR authors prefer the log((D - d)/d) notation. In Table 4, Reference 1 Gi is called probabilistic inverse document frequency (IDFP).
So, if in our previous example the document is part of a collection consisting of N = 6 documents and only two documents contain the word "tutorial", the global weight for "tutorial" in the collection would be
- if using IDFP, Gi = log((N-nf)/nf) = log((6-2)/2) = 0.6931
- if using IDF, Gi = log(N/nf) = log((6/2) = 0.4771
The whole idea of using IDFP is to make a better guess of the probability that a term will be relevant. This is done by giving a "discrimination value" to each term; i.e., the less frequently terms appear in a database, the more discriminating they are. Their discriminating power decreases as they start to appear in too many documents across the database. If you are using MySQL and want to switch to the old IDF definition (Equation 6), you may want to check the user's manual instructions:
"To go back to the old system, look in myisam/ftdefs.h for "#define GWS_IN_USE GWS_PROB" (i.e. global weights by probability) and change it to "#define GWS_IN_USE GWS_IDF" (i.e. global weights by inverse document frequency)".
Normalized Weights
In most term vector schemes document vectors are normalized to correct discrepancies in document lengths so that documents are retrieved independent of their lengths. In the classic Vector Model this is done using cosine normalization (COSN), which forces the magnitude of the weighted document vector to be one and allows one to compare the angle between the weighted vectors. However, this creates a problem: longer documents are given smaller term weights and smaller documents are favored over longer ones.
Pivoted Unique Normalization (PUQN) tries to correct for discrepancies based on document length between the probability that a document is relevant and the probability that the document will be retrieved. The point at which the resultant precision and recall curves intersect is the pivot. Documents on the left side of the pivot generally have a higher probability of being retrieved than they have of being relevant. Documents on the right side of the pivot tend to have a higher probability of being relevant than they have of being retrieved. The normalization factor can now be pivoted at the pivot and adjusted so that Nj can be increased or decreased to better match the probabilities of relevance and retrieval (1).
In the MySQL implementation, PUQN is defined as Nj = U/(1+0.0115*U), where U is the length of a document and defined as the number of unique terms present in the document. If U is shorter than the average length (computed over the entire collection of documents) the weight of the document increases, if it's equal to the average length then its weight stays the same, and if it's longer than the average length then its weight decreases.
The 0.0115 quantity is a pivot value and is defined by the user (see PIVOT_VAL in the MySQL source code header file myisam/ftdefs.h). For a technical discussion on pivoted normalization see Reference 1 or Pivoted Document Length Normalization, by Amit Singhal, Chris Buckley, and Mandar Mitra.
Thus, in the example described,
- U = 5; i.e., mysql, tutorial, dbms, stands, database
- Nj = U/(1+0.0115*U) = 5/(1+0.0115*5) = 4.7281
and from Equation 2 we obtain
wi, j = Li, j*Gi*Nj = 0.20*0.6931*4.7281 = 0.6554
Note It is important to point out that in the MySQL implementation the Li, j*Nj product is stored in the column of the articles table generated from the index. In our example,
Li, j*Nj = 0.20*4.7281 = 0.9456
Finally, global weights are computed from the rows of the articles table and all weights evaluated. Once this is done, the Rank or Relevance (R) is computed as the product of the weight and the frequency of the word in the query (qf); i.e., R = w * qf. In our example, if tutorial appears once in the query, then the rank for the document is
R = wi, j*qf = 0.6554*1 = 0.6554
Negative Weights
Some search engine marketers are familiar with expressions like "negative weights", "negative terms", and similar expressions. Here we are not talking about terms that need to be excluded from a search by using the negative ("-") query operator and that Google and other search engines refer to as "negative terms" (5).
We are actually talking about terms that affect relevance scores in a collection. Where does this effect come from? Why is that querying certain terms seem to impact negatively the retrieval of some documents? Why does this effect occurs in some collections but not in others?
This effect can be explained in terms of probabilistic inverse document frequency values. In IDFP the number of documents containing a term (nf) are substracted from the total number of documents in the collection (N); i.e., N - nf. In Figure 1, we have plotted nf (x-axis) against IDFP (y-axis) for a database collection consisting of 100 documents.

Figure 1. Behavior of a probabilistic inverse document frequency function.
Evidently,
- positive global weights are assigned when nf < N/2
- zero global weights are assigned when nf = N/2
- negative global weights are assigned when nf > N/2
Thus, when nf > N/2 negative weights are assigned to terms appearing in more than half of the documents in the collection. This explains why some terms in a query are deemed as having a "negative effect" when a given collection is queried.
This maybe of interest for marketers retrieving or positioning documents in a given search system. Understanding if a system uses IDFP or weighting systems based on MySQL or similar implementations comes handy. Knowing which terms are overused and are deemed as "negative" in a given collection can make a difference when web documents are focused or optimized for a given term, phrase or sequence of terms. It is important to point out that the nature of this effect is global and has nothing to do with the non sense fallacy known as keyword density (6). Ultimately, keyword density is a mere local ratio disconnected from document relevancy.
In addition, the negative effect I am describing is just a math behavior and not the result of using query operators or additional spam filters/penalty measures. Certainly, this doesn't mean these filters/penalties could not be in place in the target system. What we need to underscore is that negative global weights can be assigned and are assigned by systems using IDFP.
Last but not least, terms deemed as "negative" in a database collection might not be so in others for two reasons: (a) nf could still be below the N/2 mark or (b) the system does not implement IDFP.
Related Material
Some readers asked if negative weights could be avoided. The answer is yes, to a certain point. In some implementations, the use of negative weights introduce new complications during retrieval. For this reason, some arbitrarily set all negative weights to zero. On a side note, here is a nice article about the BM25 implementation, Microsoft Cambridge at TREC-12: HARD track (7), which discusses the concept of negative phrases. Also, Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking, by Aleksander Kocz and Joshua Alspector (8), discusses how the EM algorithm could be used in incomplete data problems.
Acknowledgements
The author thanks MySQL AB, Inc, for referencing this series in MySQL Internals, Manual :: 4.7 Full-text Search.
Next: The Extended Boolean Model
Prev: Vector Models based on Normalized Frequencies
References
- New Term Weighting Formulas for the Vector Space; Erica Chisholm and Tamara G. Kolda; Oak Ridge National Labs (1999).
- A survey on the use of relevance feedback for information access systems; Ian Ruthven and Mounia Lalmas.
- Document Ranking and the Vector-Space Model; Huei Chuang and Kent Seamons.
- MySQL Internals Manual :: 4.7 Full-text Search; MySQL AB, Inc. See also Peter Gulutzan's great article MySQL's Full-Text Formulas.
- Google Help Center; Google, Inc.
- The Keyword Density of Non-Sense; E. Garcia.
- Microsoft Cambridge at TREC-12: HARD track
- Asymmetric Missing-data Problems: Overcoming the Lack of Negative Data in Preference Ranking

