EF-Ratios - A Tutorial on Global and Local Term Sequences
A Tutorial on Global and Local EF-Ratios with Applications to Term Sequences, Topical Sentences, Passages, Documents and Search Engine Optimization
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 09/30/05
Article 3 of the series Information Retrieval Tutorials
Topics
On EF-Ratios
Avoiding Keyword Research Misconceptions
A Visual Example
An IR Scenario
EXACT and FINDALL Modes
EF-Ratios Defined
Global EF-Ratios
Identification of Candidate Sequences
Local EF-Ratios
Tutorial Review
References
On EF-Ratios
In two advanced series, Fractal Semantics and Keywords Co-Occurrence, and at the SES 2005, NY conference I introduced the EF-Ratio metric to the SEO community. Cre8asiteforums and SEWF Forums discuss also this metric (1 - 5).
An EF-Ratio is a way of measuring co-occurring terms. The goal is to extract global, local and scaling information from a text in which certain terms tend to co-occur in a sequential fashion. The purpose of this tutorial is to explain EF-Ratios in layman's terms and show how SEOs could use the metric.
This tutorial is organized as follows. A visual example is presented to grasp the concept of global and local measures. This is followed by a brief discussion on query modes. Next I explain how EF-Ratios could be used in different settings. Some working examples are provided. As usual, the tutorial ends with a review in the form of practical exercises.
Avoiding Keyword Research Misconceptions
Before proceeding any further, I would like to clarify that an EF-Ratio is a metric I named in that way back in 2001. The "E" stands for EXACT and the "F" for FINDALL. It is a kind of signal-to-noise and probability ratio. In this tutorial you will learn several things, one being that searching in EXACT mode is not a search for phrases, as many keyword research "gurus" and SEOs incorrectly think. Thus, searching with Google's default mode (FINDALL) for the sequence outdoor clothing may return a lot of documents, many of which do not contain these terms one after the other.
Let me explain. Searching in EXACT for outdoor clothing may return documents containing this sequence separated by delimiters and stopwords in addition to separated by spaces (as in phrases). As of 10/25/05 Google returns for outdoor clothing 19,000,000 docs in FINDALL mode and 3,150,000 docs in EXACT mode. This does not mean that 3,150,000 docs are targeting this terms in a phrase format. This observation is important since how words are stated in documents and queries is a reflection of how terms are used or searched in a given language, market sector, industry or country.
Having a clear understanding of query mode implementations may render most current keyword search volume statistics useless. A paid or free keyword research service (Google, Yahoo, Wordtracker, etc) or SEO "guru" may tell you that some terms were searched X number of times this Y week or Z month. But if they don't tell you which fraction of X accounts for searches in EXACT mode or which searches actually involve targeting a real phrase, in a specific search engine database, then the collected search volume analytics are already contaminated with a lot of noise. Right?
Now imagine that we recombine all those contaminated results from different small collections into a single listing. Isn't this worse? And how about those "keyword research gurus" that promote the idea of combining web analytics from dissimilar databases to create a new web metric? Isn't this like combining Wall Mart inventory with K-mart analytics to propose a new metric for Burger King? Why then many marketers are still buying into such services? Evidently, many have vested interest in promoting such "keyword research" products and services and in looking for an apparent quick fix to a larger problem.
The point is that for SEOs and their clients, understanding query modes is essential if they want to draw valid conclusions from such "keyword research". Let's see how this tutorial can help you in this endevour.
A Visual Example
Let say I have 3 boxes each with 4 compartments and containing apples and oranges, represented in Figure 1 by red and orange circles. The fruits tend to move freely within their compartments --a lot of disorder (entropy). Now let say that in the first and second compartment of the first box I interconnect one apple and one orange with a wood stick, so if they shift a bit the two remain in place, one in relation to the other. So the fruits are a bit more ordered within their compartments.
This degree of ordering (less entropy) can be represented by the APPLE-STICK-ORANGE sequence (see Figure 1).

Figure 1. EF-Ratio Visual Example.
The ratio of boxes containing some apples and oranges interconnected is 1/3 = 0.33 or 33%. I can use this global result to discriminate between boxes but not between compartments. To know the later I must inspect a bit closer the first box.
The ratio of compartments in the first box containing some apples and oranges interconnected is 2/4 = 0.50 or 50%. The information is more specific or local.
In this example, I computed two ratios based on whether or not the fruits are ordered following the APPLE-STICK-ORANGE sequence. I can extract global information since the sequence allows me to discriminate between boxes. Note that I can also distinguish between compartments from the first box but not between all compartments. That is, I can extract local information and ordering pertaining to the first box.
I can have more boxes with fruits, make the compartments smaller or bigger to accommodate different amounts of apples and oranges. The fruits can be interconnected or not, be of different sizes, connected in groups of 2, 3, 4, etc. I can even subdivide the compartments into smaller units and compute more specific ratios using the APPLE-STICK-ORANGE sequence or other specific sequence as the determining factor. The calculations are more content-bearing and specific. I am grasping different information in each case.
For reasons that will be obvious, I am going to call the computed fractions EF-Ratios. Say "Hi!" to this metric. A technical definition is given in Reference 1.
An IR Scenario
Let's transfer the above example to an information retrieval (IR) scenario by assuming that the
- boxes represent number of retrieved documents.
- compartments represent passages from the documents.
- fruits represent specific terms.
- sticks represent separators (spaces, stop words or delimiters).
If a separator is a space, then a term sequence has a phrase format. It may happen that the separators are not spaces but stop words (terms ignored by the system such as in, a, of, the, etc), delimiters (periods, colons, semicolons, pipes, hyphens, and similar characters) or combination of these.
If the IR system or search engine was programmed to ignore these separators, and I search for, let say, discount hotels, the system may return documents containing
discount hotels
discount the hotels
discount & hotels
discount - hotels
discount | hotels
discount: hotels
... discount. Hotels ...
or similar sequences. Some documents may even contain more than one of these sequences or sequences containing multiple separators.
After ignoring the separators, the system will interpret these as similar co-occurring sequences even when these are not real phrases. So, for a search engine an "exact sequence" is not necessarily a phrase. All depends on the parsing rules and regular expressions rules used to match queries to text. Consequently, search volume counts in both FINDALL and EXACT mode for discount hotels are not telling the whole story and therefore better keyword research services are needed.
What about passages? I can define passages as entire paragraphs, sentences, group of sentences, etc. I can even define a passage as text windows of a given length where the length is given by number of words or characters. In each case I should be able to compute a local EF-Ratio. How I define the passages should increase the amount of local information I can get.
Before using this metric with a search, I need to know about the two infamous query modes: FINDALL and EXACT.
EXACT and FINDALL Modes
Searching in EXACT mode instructs a search engine to return documents containing the exact sequence of terms as specified in the query. To tell the system to search in EXACT mode we can use their advanced search features. In most search engines, you can also specify this by double or single quoting query terms in the search box.
So, if I search in EXACT mode for "discount hotels", documents containing the intended sequence, whether the separators are spaces, stop words or delimiters will also be returned. However, documents containing the transponse sequence "hotels discount" or either term anywhere in the documents will be ignored, unless the intended sequence appears also in these.
In most search engines, "appears" means that query terms can be found in the content visible or invisible to the reader; i.e. in the source code, including the document url. For instance, in an HTML document query terms can appear in the head, body or url.
Searching in FINDALL mode is a lot easier since in most search engines this is the default mode, also known as AND. In this mode, the system returns documents containing all query terms in no particular order. Terms can appear anywhere in the documents and separated by any number of arbitrary terms or characters. This is the mode most searchers use.
EF-Ratios Defined
Evidently, searching in FINDALL should return also documents one would retrieve in EXACT mode. This is why we say that results in EXACT mode are a subset of the results obtained in FINDALL mode. Thus, if I take the ratio of documents retrieved by the two modes I get

Figure 2. EF-Ratio Equation.
That is, an EF-Ratio is the fraction (or percent) of documents retrieved in FINDALL containing the EXACT sequence of terms as specified by a query.
Here is an example. On 10/01/05 a search in Google for discount hotels returned 54,900,000 results. When I double quoted the terms and pressed the 'Search' button Google returned 13,200,000 results for "discount hotels". Evidently,
EXACT results = 13,200,000
FINDALL results = 54,900,000
EF-Ratio = 13,200,000/54,900,000 = 0.24 or 24%
That is, 24 out of 100 (or 1 out of 4) documents target the exact sequence "discount hotels".
Global EF-Ratios
This is a quite useful global information. I can use global EF-Ratios to find out how natural or unnatural, popular or unpopular, targeted or untargeted a sequence may be in a given search engine. For instance, I computed in Google an EF-Ratio for the reverse sequence "hotels discount" and this is what I got
EXACT results = 2,950,000
FINDALL results = 54,900,000
EF-Ratio = 2,950,000/54,900,000 = 0.05 or 5%
Clearly in Google, only 5% of documents relevant to discount hotels target the reverse sequence. Thus, a global EF-Ratio can be used to elucidate:
- which sequence is more targeted or popular in a collection
- the relative competitiveness among documents for a sequence
- more likely, for which sequence it may be harder to rank for
For example, early this year I tested two natural and an unnatural sequences in Yahoo! and obtained the following:

Figure 3. Natural and Unnatural Sequences.
Note that sequences that flow naturally, are not forced, or that are proper due to common usage tend to exhibit higher EF-Ratios than those unnatural, forced or not common. I can do a similar analysis in other languages. For instance, let assume that I do not know Spanish but I need to make an educated guess between the candidate sequences "vacaciones de verano" and "vacaciones verano de". Computing EF-Ratios may help since in most cases incorrect sequences are rarily targeted.
Of course, when using EF-Ratios in this way I need to have additional information since an unnatural sequence can be targeted by different means or can be accepted because has a specific valid meaning in a given language or because is a popular brand or part of a popular slogan. To illustrate, try to compute EF-Ratios for "hot pizza" and "pizza hot". The lesson here is that we cannot compute EF-Ratios disconnected from common usage in a given sector or demographic, as the metric is what it is: a research tool, not a silver bullet or an "oracle". This applies to English and Spanish sequences and to any language for that matter.
We also need to keep in mind that EF-Ratios are database-specific. For instance, early this year I wanted to identify the most targeted sequences about a given topic in Google and MSN. My goal was to determine how well targeted were several on-topic sequences so I could decide for which sequence I should optimize documents indexed by these search engines. This is what I got:

Figure 4. On-Topic Sequences.
So, back on 02/15/05, in Google more documents targeted "discount hotels" than "budget hotels" while in MSN the reverse was the case. More likely, it was easier to rank for "discount hotels" in MSN than in Google since in the later too many documents, almost 60 out of 100, were competing for visibility in these search engines using this sequence. I also noticed from the graph that more likely, it was easier to rank for "economy hotels" than for "discount hotels" in Google since less than 3 out of 100 were targeting "economy hotels". Thus, the global EF-Ratio tells whether there was too much competition between documents for a given sequence. I often refer to this as text popularity or text competition.
That was then. If I want to determine all this today, I would need to repeat the analysis and check for any change in text popularity.
Thus, EF-Ratios can be used to monitor term sequences over time. I call this type of research temporal co-occurrence analysis (TCA). The goal of TCA is to determine temporal trends and patterns such as amount of noise in answer sets, seasonal trends and how external world events influence search results.
Identification of Candidate Sequences
A question that often arises in discussion forums relates to FINDALL queries. This is a mode where the answer set should consist of documents containing all query terms in no particular order or proximity. Does this mean that a search for k1 and k2 is the same as searching for k2 and k1? Absolutely not. Some times total number of results differ. Some times they agree but are sorted differently. There are many reasons of why a query in this mode can produce different results upon term transpositions. I have explained this on several occasions at SEO dicussion forums and at this site (1 - 5). However, this may raise some valid questions when is time to elucidate candidate sequences.
There are two ways of elucidating candidate sequences. One method consists in conducting a TCA study to monitor over time several sequences from a pool of terms. This approach works well, especially when we deal with long-term correlations over time as is the case of seasonal trends and sudden search results triggered by external events.
Another method I use consists in making a matrix of EF-Ratios by assuming all possible combinations for the ratios. I use this method when I need to determine the current state of text popularity for a sequence.
If I search in EXACT mode for "discount hotels", documents containing the sequence may also be contained in the discount hotels or hotels discount FINDALL sets or in both sets. I'm going to call these SET 1 and SET 2. Similarly, if I search in EXACT mode for "hotels discount" documents containing this sequence can be part of the discount hotels or hotels discount FINDALL sets or in both sets. For a query consisting of two terms, I can inspect the likelihood of a combination by computing four different EF-Ratios.
Figure 5 shows recent search results in Google (10/06/05), where the four EF-Ratios were computed.

Figure 5. Identification of Candidate Sequences by the EF-Ratio Matrix Method.
According to these results, "discount hotels" seems to be more targeted. This was also true on 02/15/05. Back then the relative error defined as the relative deviation (deviation/mean) between the two FINDALL sets was very small and negligible even considering that Google returned far more results, about 25,000,000 to 30,000,000 more results. Where are those documents? I don't know. Ask Google. A seasonal reason, a filter or database upgrades/purging could account for these.
In Figure 5, I computed the EF-Ratios by assuming that documents containing the exact sequences belong 100% to either SET 1 or SET 2. If the documents were 50:50 distributed, I need to divide the computed EF-Ratios by two. This won't affect the fact that "discount hotels" is more targeted than "hotels discount".
Local EF-Ratios
Global ratios tell me a lot, but they do not tell me WHERE in a document a term sequence occurs. If I want to determine this I need to compute a local EF-Ratio across document passages. I see many applications of EF-Ratios in the area of document segmentation analysis. Let me explain.
A passage is a text window. The length of a passage can be defined in term of characters, words, sentences or paragraphs. Depending on what I want to analyze or conduct text summarization on, I can define passages as a 30-word window, 100-character window, as every other sentence, as a group of sentences, and the like.
Initially, I defined a passage as a sentence. The problem with this definition is that: (a) more likely, large documents consist of more sentences than short documents, (b) not all sentences are of the same length and (c) a sequence can appear more than once in a sentence. With all, local information can be extracted from sentence passages.
By a sentence passage I mean text streams ending in periods, semicolons, question marks or exclamation points. If you ask OKAPI users, they will tell you why sentence and passage summarization methods are so important. Many text analysis and ranking algorithms use OKAPI-related measures. Many readability researchers have combined OKAPI/Dave Chall-based measures in sentence passages extraction. Just do a search in Google for "okapi sentence analysis" (6).
Passages are a flexible, scalable concept. For instance, in topic analysis and readability studies we look at topical sentences. Depending on writers' intentions, topical sentences can be defined as the first or last sentence of a paragraph. So, having a local co-occurrence measure (local EF-Ratios or C-Indices) that examines topical sentences is a practical way of analyzing authors' intentions, a text or a discourse.
Topical sentences are important to editors of scientific publications, press releases and news stories. Indeed, some journals, scientific publications and organizations such as the Astronomical Society of the Pacific ask authors to use topical sentences (7). These type of sentences are so important that are recommended in test taking, essay examination and other settings (8).
In addition to topical sentences, passage analysis can be applied to abstracts, portions of documents or entire discourses. I can even apply the concept to books by defining passages as the first X number of words or paragraphs in a chapter. To some extent, with this convention I can do text summarization or examine if chapters of a book are related to each other in a topical manner. The point is that term co-occurrence metrics (EF-Ratios or C-Indices) allow me to extract local information to a given level of granularity.
Once I define a passage I should be able to conduct all sort of co-occurrence analyses, either C-Index based or EF-Ratio based. To illustrate, let say that I want to optimize a document for two terms k1 and k2 for a given search engine, that I define a passage as a sentence and that I define separators as interpreted by the target search engine. To compute a local EF-Ratio for the document I just need to use the following convention:
- count the number of sentences containing k1 and k2
- count the number of sentences containing the exact sequence "k1 k2" (delimited by a separator).
- calculate an EF-Ratio
If I want to compare results with prospective competitive documents, what should I do? Well, it would make sense to query my target search engine for the intended sequence and compute EF-Ratios from the top N ranked documents. Right?
Nice exercise. I have more of exercises in the next section. Some are applications to popular and natural sequences, on-topic combinations, text popularity, search volume, site traffic, topical sentences and discourse summarization. Have fun!
Tutorial Review
Compute EF-Ratios to answer the following questions:
- Which of the following sequences have a higher EF-Ratio in Google and Yahoo?
(a) paris hilton or hilton paris
(b) bart simpson or jessica simpson - Which of the following sequences are more targeted in MSN? Justify your answer.
(a) economy cars
(b) budget cars
(c) affordable cars - Repeat exercise 2 but this time in Google. Compare results. For which sequence you think it may be easier/difficult to rank for in MSN and Google and why?
- Which of the following sequences in Spanish are more natural and why?
(a) seguro de autos or de autos seguro
(b) hotel de lujo or lujo hotel de - Check your search log files. Assuming your logs can distinguish between FINDALL and EXACT queries, construct highly searched sequences. Sort in decreasing order of EF-Ratios.
- Do a search in Google for a sequence you want to optimize a document for. Compute EF-Ratios for the document ranked #1. Repeat calculations, computing EF-Ratios for the top 10 ranked documents. Compare results with documents in positions 21 to 30 or 91 to 100.
- In exercise 6, how many documents listed in the top 10 results in EXACT are also listed in the top 10 in FINDALL?
- Compute an EF-Ratio for the body of a document you optimized or want to optimize, first defining passages as sentences and second as paragraphs. Compare with the EF-Ratio of a document ranked in the top 10 in Google for similar terms. Add meta data, title and other on page factors. Recompute the EF-Ratio for the entire document.
- Define a passage as a topical sentence. Compute an EF-Ratio for a large document. Compare with a shorter document about the same topic. For the same document, define a passage as a text window of 30 words. Compute and compare EF-Ratios.
- Define a passage as the first 150, 250 and 400 words of a chapter. In each case, compute an EF-Ratio for a book.
Prev: Cosine Similarity and Term Weight Tutorial
References
- Overlapping Patterns: EF-Ratios, Separators, Patterns and Pitfalls; E. Garcia (2005).
- Keywords Co-Occurrence and Semantic Connectivity; E. Garcia (2005).
- Introduction to Co-Occurrence Theory; E. Garcia, Advanced Issues Track, Search Algorithms and Research; Search Engine Strategies Conference, New York City - Feb 28 - March 3, 2005.
- Search Engine Watch Forums; Jupitermedia, Inc.
- Cre8asiteforums; Cre8pc.com.
- Okapi Sentence Analysis; Google Search
- ASP: Guidelines for Mercury Contributors; The Astronomical Society of the Pacific.
- Test Anxiety; University of Florida Counseling Center (2003).

