The Extended Boolean Model
The Extended Boolean Model - Weighted Queries: Term Weights, p-Norm Queries and Multiconcept Types. Boolean OR Extended? AND that is the Query.
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 10/27/06
Article 6 of the series Term Vector Theory and Keyword Weights
Topics
The OR Nature of Conventional Term Vector Models
The Extended Boolean Model
Boolean OR: Weights and Euclidean Distances
Boolean AND: Weights and Euclidean Distances
Normalized Similarity Scores
The p-Norm
AND/OR Combinations
Query Weights
Acknowledgements
References
The OR Nature of Conventional Term Vector Models
"- Boolean OR Extended? AND that is the query."
A search using the Boolean OR operator (e.g., euclidean OR distance) considers a document relevant if it contains at least one of the terms specified in the query. As Professors David Grossman and Ophir Frieder explain in the second edition of Information Retrieval: Algorithms and Heuristics (read my review of their book),
"The conventional vector space model, implicitly computes a ranking that is essentially an OR of the document terms. Any document that contains at least one of the terms in the query is ranked with a score greater than 0." (1, 2)
Evidently conventional term vector models are intrinsically OR models, where it is assumed that terms are not related. This is not reasonable since sometimes terms are related (thus, the associated DOT Products are not zero).
To top off, Boolean models are limited in their scope. For instance, the models assume that
- all terms in a query are equally weighted; thus, there is no need or provision for assigning a degree of importance to terms in a query.
- for OR (ANY) searches, documents containing at least one of the query terms are as relevant as documents containing all of the query terms.
- for AND (FINDALL) searches, documents not containing at least one of the query terms are as irrelevant as documents containing none of the query terms.
As we can see, conventional Boolean models are a kind of binary approach in which documents are considered either relevant ("1") or irrelevant ("0"). This approach for scoring relevancy is shortsighted since for users the notion of relevancy is not static but dynamic; i.e., for a given user, what is irrelevant today may be relevant tomorrow and vice versa. In addition, the degree of relevancy can be subject to synonymity associations between terms.
For example, two documents may convey the same relevant information even if they don't share the same key words or key phrases. Moreover, at the level of individual words, terms can experience a phenomenon known as in-transit relevancy. Consider two terms A and B. They may not be relevant to each other, but they may be relevant to a third term, C. Should we deem as irrelevant a document that contains all three terms? To what degree the document is relevant (or irrelevant)?
The Extended Boolean Model
To overcome these limitations, Salton, Fox and Wu introduced in 1983 the Extended Boolean Information Retrieval Model. The model is the doctoral thesis of Edward A. Fox, Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types (3, 4). Professor Fox graciously emailed me references of his outstanding work, which I am including in this article (3 - 16).
Fox has summarized the main features of his model in Extended Boolean Queries and Retrieval (5). Essentially the Extended Model
- consider all of the terms in a query.
- adjust the strictness of each AND or OR query operator with a p-value.
- proposes a general model, p-norm, that has as special cases the standard Boolean model (with fuzzy set interpretation --- when p is infinity) and the vector-space model (with inner-product similarity --- when p is one).
- gets a spectrum of models with decreasing strictness, i.e., strict AND ... soft AND ... vector ... soft OR ... strict OR:
- p-norm AND with p=infinity behaves like strict Boolean AND (i.e., MIN)
- p-norm AND with p at moderate values softens the strictness of the AND
- p-norm AND with p=1 behaves like p-norm OR with p=1 and behaves like vector space model
- p-norm OR with p at moderate values softens the strictness of the OR
- p-norm OR with p=infinity behaves like strict Boolean OR (i.e., MAX)
- uses L-p family of norms to compute similarity by measuring:
- distance from 0 point (i.e., none of query terms present) for OR;
- 1 - distance from 1 point (i.e., all of query terms present) for AND.
- visualizes all this with equi-similarity contours at fixed p-values.
Perhaps the best way of understanding the Extended Boolean Model is by visualizing its implementation with some graphics.
Boolean OR: Weights and Euclidean Distances
In Fox's Extended Boolean Model term weights are scaled between 0 and 1. This is accomplished by using tf*idf products in which both term frequencies, tf, and inverse document frequencies, idf are normalized. Thus, the weight w of term i in document j is given by

and where
wi, j = weight of term i in document j
tfnorm i, j = normalized term frequency of term i in document j
tfi, j = term frequency of term i in document j
tfmax i, j = maximum term frequency of term i in document j
idfi = inverse document frequency of term i in the collection c
idfmax g = maximum inverse document frequency of a generic term g in the collection c
idfnorm i = normalized inverse document frequency of term i in the collection c
This has important implications for multiple term queries. Consider, for example, a query Q consisting of two terms, k1 and k2. A term space representation of this query contains two extreme points:
- one at (1, 1) corresponding to documents completely relevant; i.e., containing both terms.
- one at (0, 0) corresponding to documents completely irrelevant; i.e, containing neither terms.
In this representation, the maximum Euclidean Distance or maximum displacement (dmax) between the two points is 1.41. This is illustrated in Figure 1.
Figure 1. Maximum Euclidean Distance, dmax, in a two-dimensional term space.
Consequently, for this OR query the Euclidean Distance of a document at (wk1, wk2) must be d < 1.41. See Figure 2.
Figure 2. "OR" Euclidean Distance in a two-dimensional term space.
I refer to this as an "OR" distance just to distinguish this case from a query using the Boolean AND operator. We can take this distance from origin as a similarity measure and rank documents, with those closer to the origin being the least relevant one.
Boolean AND: Weights and Euclidean Distances
For AND queries, a similarity measure can be computed by substracting the Euclidean Distance between the (1, 1) and (wk1, wk2) points from the maximum distance, dmax. See Figure 3.
Figure 3. "AND" Euclidean Distance in a two-dimensional term space.
Normalized Similarity Scores
To compare similarity scores for a variety of scenarios we need to normalize all distances by dividing by dmax. Thus, for OR and AND queries we obtain
Figure 4. Normalized similarity scores.
The p-Norm
Thus, for any number of terms, m, we obtain expressions that depend on a p-parameter. This is the p-norm.

Figure 5. The p-Norm.
The p-Norm posses interesting properties. When p = 1 a simple vector space model is obtained. The distinction between OR and AND disappears and the similarity between queries and documents can be computed by the inner product between their weights, as done by conventional term vector similarity formulas. When p = infinity, AND queries behave like strict Boolean AND and OR queries behave like strict Boolean OR.
AND/OR Combinations
By varying the p-Norm between 1 and infinity, we can vary the p-Norm ranking behavior from a vector-like to a Boolean-like ranking. Furthermore, as noted by Baeza-Yates and Ribeiro-Neto (17) the p-Norm allows for combination of AND/OR queries by recursively grouping operators. For example, for a query Q = k1 AND k2 OR k3 the similarity between a document and this query can be computed as shown in Figure 6,
Figure 6. Computing similarity scores for a query Q = k1 AND k2 OR k3.
where the k1 AND k2 part of the query is treated as a single term k* that is part of an OR query. Here we instruct the system to return documents containing k* OR k3.
Query Weights
Finally, to include query weights we multiply term weights times query weights. Thus, for a query consisting of only two terms the similarity score becomes

Figure 7. Including Query Weights
From Figure 7 and several reviews (18), when p = infinity
- this reduces to a normal Boolean system where term weights are not included. In addition, a p value of infinity gives the MIN or MAX value of the document.
- the AND operator represents a strict phrase assignment; i.e. if all of the query terms are not present, then disregard the document.
- the OR operator implements a strict thesaurus; i.e., any term being present selects the document.
Moreover, when
- p = 1, this is equivalent to the DOT Product as computed by the Vector Space Model; all terms are independent - AND's and OR's have no distinction.
- p = 2, this is equivalent to a normalized Euclidean Distance.
- p = 3, with AND's and OR's, more terms are more important than less terms.
- For q = 1 in each case, we recover the definition of similarity scores given in Figure 4. (a, b)
(a) While reviewing the 2nd Edition of Grossman and Frieder's "Information Retrieval" and writing this article I found in page 69 of the text, what appears to be a misplacement error. This is found in the equation for similarity scores using the AND operator and shown in Figure 7. The p-norm exponent is given inside parentheses; i.e. (1 - w p) when it should go outside of these; i.e., (1 - w) p.
(b) Few days ago, we informed the authors about this finding. They were forthcoming in acknowledging the error.
Acknowledgements
The author thanks Professor Edward A. Fox for providing him with references to the Extended Boolean Model. The author also want to thanks Professor Grossman for acknowledging the findings given in (a) and for additional feedback.
Prev: Implementation and Application of Term Weights in a MySQL Environment
References
- D. Grossman and O. Frieder, Information Retrieval: Algorithms and Heuristics; Chapter 2, p. 69, Springer (2004).
- Information Retrieval: Algorithms and Heuristics - A Book Review; E. Garcia (2005).
- G. Salton, E. Fox, and H. Wu. Extended Boolean Information Retrieval. Communications of the ACM, 1983, 26(11): 1022-1036.
- E. Fox. Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types. Cornell University dissertation, Aug. 1983. Available from: University Microfilms International, Ann Arbor, Michigan. See also Professor Edward A. Fox's page.
- Extended Boolean Queries and Retrieval; Edward A. Fox.
- G. Salton, E. Fox, and E. Voorhees. Advanced Feedback Methods in Information Retrieval. JASIS, 1985, 36(3): 200-210.
- G. Salton, E. Voorhees, and E. Fox. A Comparison of Two Methods for Boolean Query Relevance Feedback. IP&M, 1984, 20(5-6): 637-651.
- G. Salton, C. Buckley, and E. Fox. Automatic Query Formulations in Information Retrieval. JASIS, 1983, 34(4): 262-280.
- E. Fox and M. Koll. Practical Enhanced Boolean Retrieval: Experiences with the SMART and SIRE Systems. IP&M, 24(3): 257-267, 1988.
- E. Fox, S. Betrabet, M. Koushik, and W. Lee. Extended Boolean Models. In Information Retrieval: Data Structures & Algorithms, eds. W. Frakes & R. Baeza-Yates, Prentice-Hall, 1992, 393-418. Also on CD-ROM published by Dr. Dobb's Journal.
- G. Salton, E. Fox, and H. Wu. An Automatic Environment for Boolean Information Retrieval. In Information Processing 83 (Proc. 1983 IFIP Paris Congress), R.E.A. Mason (ed.), North-Holland, 1983, 755-762.
- W. Lee and E. Fox. Experimental Comparison of Schemes for Interpreting Boolean Queries, TR-88-27, VPI&SU Comp. Science Dept., September 1988, Blacksburg, VA.
- E. Fox and S. Sharan. A Comparison of Two Methods for Soft Boolean Operator Interpretation in Information Retrieval, TR-86-1, VPI&SU Computer Science Dept., Jan. 1986, Blacksburg, VA.
- G. Salton, E. Fox and E. Voorhees. Advanced Feedback Methods in Information Retrieval. TR 83-570. Cornell Univ. Dept. of Computer Science, Aug. 1983, Ithaca, NY.
- G. Salton, E. Fox and E. Voorhees. A Comparison of Two Methods for Boolean Query Relevance Feedback. TR 83-564, Cornell Univ. Dept. of Computer Science, July 1983, Ithaca, NY.
- G. Salton, C. Buckley and E. Fox. Boolean Query Formulation with Relevance Feedback. TR 83-539, Cornell Univ. Dept. of Computer Science, Jan. 1983, Ithaca, NY.
- R. Baeza-Yates, B. Ribeiro-Neto; Modern Information Retrieval Chapter 2, p. 40; Addison-Wesley (1999).
- Reviews of the Exended Boolean Information Retrieval Paper; Several succinct reviews.

