Home - Contacts - Terms -

Mi Islita

Overlapping Patterns:
EF-Ratios, Separators, Patterns and Pitfalls

"Overlapping patterns within overlapping patterns! It takes few seconds to realize that query mode implementations can exhibit a fractal component."

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 06/16/05

Article 3 of the series The Fractal Nature of Semantics

Topics

FINDALL and EXACT Modes

EF-Ratios Defined

Venn Diagrams

Separators and EXACT Queries

Implications for Search Engine Optimization

Patterns and Overlapping Regions

Pitfalls SEMs/SEOs should Avoid

References

FINDALL and EXACT Modes

At Jupiter Media's Search Engine Strategies 2005 Conference & Expo (Feb 28 - March 3) I formally presented the notion of C-Indices and EF-Ratios to the search engine marketing industry. I also mentioned that in most search engines (Google, Yahoo!, MSN, etc) the default query mode is FINDALL, also known as AND.

Unlike OR (ANY) -in which retrieved documents must contain any of the query terms- or unlike EXACT -in which the documents must contain a queried sequence of terms- in FINDALL the documents retrieved must contain all query terms, regardles for order and proximity (1, 2).

This means, for example, that a query (Q) consisting of two terms (k1 + k2) should retrieve documents containing all query terms (in this case, k1 and k2) in no particular order. The terms can appear anywhere in the document descriptors and identifiers (title, meta tags, url, links, body, element attributes, etc).

However, if one surrounds k1 and k2 with quotes (Q = "k1 + k2"), most search systems recognize this as an advanced search conducted in EXACT mode. In this mode order and proximity does matter. Thus, the system must return documents relevant to the k1 and k2 exact sequence. Therefore, the following are different queries and should retrieve different set of results:

Evidently,

OR (ANY) results > FINDALL (AND) results > EXACT (quoted) results

EF Ratios Defined

Results in EXACT mode are a subset of the results obtained in FINDALL mode. This allows me to define an EXACT-to-FINDALL ratio I call the EF-Ratio

 EF-Ratios
Figure 1. EF-Ratios Equation.

Thus, as an approximation an EF-Ratio gives the fraction of documents from a FINDALL search that are relevant to an EXACT search. To illustrate, consider the query Q = discount hotels, where k1 = discount and k2 = hotels

Co-occurrence statistics for discount hotels. Source: Google, 02/15/05
Q = k1 + k2 n1 for k1 n2 for k2 n12, FINDALL Mode n12, EXACT Mode EF-RATIO
discount hotels 136,000,000 364,000,000 24,300,000 14,400,000 59.26%

The table shows that

This means that EF-Ratios estimate the relative frequency of natural sequences (e.g., phrases) in a corpus. So, one can estimate how naturally 2, 3,... n terms may co-occur next to each other in a search engine collection.

An EF-Ratio is also an estimated probability value, P. When I first developed the EF-Ratio and c-Index metrics I defined the former as follows:

Proposed Definition: Given a query Q = k1 k2 k3...kn consisting of n terms and where each k is a single term. The probability that a search for Q in FINDALL mode would return documents with the EXACT sequence Q = k1 k2 k3...kn is its EF ratio.

The estimated probability of randomly selecting a document from the 24,300,000 returned documents and that this would be relevant to the exact sequence of queried terms is 59.26%. The odds or likelihood of such event is

0.5926/(1 - 0.5926) = 1.45 or about 3 to 2.

Note that I emphasize the word "estimated". With most IR systems there is a margin of error since number of relevant and retrieved documents can be different (precision-recall issues). With commercial search engines like Google we can only rely on the fact that the retrieve results are supposed to be relevant.

Venn Diagrams

Figure 2 shows a Venn Diagram of the EF-Ratio metric for a 2-term query.

EF-Ratio Venn Diagram for k1 and k2
Figure 2. EF-Ratio Venn Diagram for k1 and k2.

The circle at the left -labeled n1- represents the number of search results relevant to k1 while the circle at the right -labeled n2- represents the number of results relevant to k2. The overlapping region, n12, represents the number of results relevant to k1 and k2 in FINDALL mode. The circle in the middle of the overlapping region represents the number of results in EXACT mode relevant to k1 and k2.

Figure 2 can be used to visualize EF-Ratios of 2-term queries but is incomplete since document retrieval is often the results of many processes taking place inside an IR architecture. One of such processes is document tokenization. During tokenization, text is lowercased and delimiters and stopwords are removed or ignored. Which delimiters and stopwords are removed or ignored depends on the parsing rules and filters (e.g., library of regular expressions) used by each search system.

This means that EXACT implementations as EF-Ratio metrics are unique to each search engine. Consequently, the practice of combining metrics extracted from one search engine with metrics from another search engine is a highly questionable practice and should be avoided. The scientific reasonings behind this statement are given in the next sections.

Separators and EXACT Queries

Not all EXACT mode implementations are identical. With documents, the content to be matched is often processed using different parsing and tokenization rules. Documents may contain instances in which the following separators are found between the queried terms

  1. s1: spaces; e.g., as in phrases
  2. s2: delimiters; e.g., pipes, dashes, colons, semicolons, periods, commas, etc.
  3. s3: stopwords; e.g., a, in, of, the, an, etc.

Therefore, the subset of documents returned in EXACT mode is a composite of other subsets. See Figure 3.

 EF-Ratio Venn Diagram for k1 and k2
Figure 3. Approximate representation of the separators inside an EXACT subset.

Implications for Search Engine Optimization

This has serious implications for searches and search engine optimization. It is now clear that, contrary to claims found in the literature on search engine marketing, a search in EXACT mode IS NOT necessarily a search for phrases. Furthermore, the way a system implements delimiters (e.g., pipes, hyphens, colons, semicolons, periods, commas, etc) affects query results, retrieval and performance. For instance, a search engine that does not ignore the underscore ("_") delimiter, should interpret the k1_k2 sequence as a single term, not as k1 followed by k2.

A practical consequence of these findings is that if a SEM or SEO specialist knows that a search engine ignores, let say periods, pipes, semicolons and other delimiters or specific stop words, he/she can mix and match keywords with the ignored characters or terms. Thus, instead of repeating over and over and like a broken record a k1 + k2 +...kn sequence, the optimizer could do something like this in the title or body of a document:

Note how mixing and matching terms with delimiters and stopwords can be exploited. Unfortunately, this also works as an invitation for the adversarial practice of keyword spamming.

Understanding how search engines implement queries comes handy. For instance, in Google the hyphenated portion of a query is interpreted as an EXACT local condition even if the query is submitted in FINDALL mode. So a search for k1-k2 + k3 is interpreted as a search in which the k1 k2 sequence can appear before or after k3, anywhere in the document and regardless for how close or far it is from k3. The localized EXACT condition will force the query to ignore documents containing k1, k2, and k3 if these do not contain the k1 k2 sequence.

Another thing SEMs/SEOs need to understand is that search engines may implement delimited queries in different ways. For instance, as of today FINDALL queries delimited by pipes in which the pipe is not surrounded by spaces, as in

k1|k2

is interpreted as an OR query by Google while the same query is treated as an EXACT query in MSN. However, if the pipe itself is delimited by spaces, it will be ignored and both search engines will interpret the query as a regular FINDALL query.

Patterns and Overlapping Regions

Figure 3 is just an approximate representation of the separators inside an EXACT subset. More likely a given document may contain instances of k1 and k2 with different or even combined separators. A more realistic representation describing an EXACT implementation is given in Figure 4.

 EF-Ratio Venn Diagram for k1 and k2
Figure 4. Overlapping regions within overlapping regions.

The figure accounts for all possible combinations. It is essentially a Venn Diagram consisting of overlapping regions within overlapping regions: a self-organized, self-similar scenario. It turns out that the subsets conforming the EXACT subset actually account for documents containing instances of k1 and k2 separated by

This means that a search in EXACT mode returns an overlapping cluster of documents according to the nature of the separators. The only subsets that account exclusively for phrases are the s1 subsets, where k1 and k2 are separated by spaces.

Now if a search engine does not remove stopwords before, during or after the tokenization process, instances of k1 and k2 separated by stopwords (e.g., k1 in k2, k1 of k2, etc) will not count as exact matches. Thus, the subsets s3, s13, s23 and s123 should not exist and the overlapping pattern of the Venn Diagram becomes a bit more self-similar (self-similar patterns consist of reduced copies of themselves). This is illustrated in Figure 5.

 EF-Ratio Venn Diagram for k1 and k2
Figure 5. Self-overlapping Venn Diagram for 2-term query.

Note the obvious: an overlapping pattern within an overlapping pattern. A similar Venn Diagram can be used to describe the EXACT implementation of systems that do not ignore delimiters.

The above analysis can be applied to queries consisting of more than two terms, but the corresponding Venn Diagrams and overlapping patterns are a bit more complex. Figure 6 shows the corresponding diagram for a 3-term query. Overlapping patterns within overlapping patterns! It takes few seconds to realize that query mode implementations can exhibit a fractal component.

 EF-Ratio Venn Diagram for k1 and k2
Figure 5. Self-overlapping Venn Diagram for a 3-term query.

Pitfalls SEMs/SEOs Should Avoid

Since tokenization procedures as EXACT mode implementations are more likely to vary from one search engine to another, it is clear why the idea of combining query metrics from one search engine with metrics from another search engine to come up with a new metric (or math model) is a contraindicated procedure. Results from such approaches are at best too speculative and too limited. These type of keyword research strategies can also mask trends and interactions between critical experimental variables.

Ultimately, metrics based on search results are not just affected by scoring term weights. They can be the result of tokenization, document linearization and similar procedures taking place at the level of the individual IR architectures. Overall, search engine result pages (SERPs) are the result of ranking algorithms, not necessarily of users' search behaviors (how often a term or phrase is queried). As a matter of facts, so far most ranking algorithms care less about how frequent was a term searched this or the previous month, week or day; of course, unless we talk about paid services (e.g. pay-per click scores) or of a new relevancy algorithm that considers search volumes for the final score.

Thus, combining search metrics from different search engines (i.e., search volumes from one search engine with search results from another search engine) to try to come up with a new metric is a highly questionable approach. This is like trying to combine sales number from one retail store with stock inventory from a competitor store or a third party store.

Still, from time to time some "keyword researchers" claim that combining Overture's (a Yahoo property) searches volume with Google query results or Wordtracker results can be used to produce a new keyword metric. I can think of many reasons of why this is contraindicated and plain misleading.

In the particular case of Overture and Wordtracker, these provide total counts for queries using the default search engine query mode. Therefore, these numbers do not tell the user which fraction of these correspond to EXACT results, transpose results (i.e., k2 + k1), or results in which a fraction is in response to documents containing the query terms combined with ignored delimiters and stopwords. My advice: SEOs, SEMs and keyword research firms should stay away from such practices before their metrics lose credibility in the industry or with clients.

Next: Grammar, Semantics, Knowledge and Fractals

Prev: Fractal Motifs and Iterated Function Systems (IFS)

References
  1. E. Garcia, Introduction to Co-Occurrence Theory; Advanced Issues Track, Search Algorithm Research & Developments, Search Engine Strategies 2005 Conference & Expo, Feb 28 - March 3, 2005, New York City, NY.
  2. E. Garcia, Keywords Co-Occurrence and Semantic Connectivity Strategies.

Thank you for using this site.
Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -