Targeting Documents and Terms Through Co-Occurrence Data
Targeting Documents and Terms: Using Co-Occurrence Data, Answer Sets and Probability Theory
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 12/13/05
Article 3 of the series Keywords Co-Occurrence and Semantic Connectivity
Topics
Co-Occurrence and Term Targeting
On Targeting Strategies
An Idealized Example
Targeting the n1 Answer Set
Prior Knowledge
Targeting the n2 Answer Set
Odds and Odds Ratios from Search Results
Why Should We Care About This?
Applications to SEO Copywriting
Applications to Multi-Term Queries
References
Co-Occurrence and Term Targeting
In Article 1 and 2 of this series we discussed term relatedness and co-occurrence and presented a mathematical background for first and higher order co-occurrence. (1, 2). We also covered several association measures and adviced against the blindfold calculation of c-indices from answer sets (retrieved results).
In this article, we examine the retrieval behavior of co-occurring terms. In particular, we want to answer the following questions.
Given a set of pre-targeted terms, for instance, a pool of A, B, C, D... terms, what is the probability that a user searching for a given term or sequence from this pool (e.g., B or B + D), would find within the answer set (search results) a document containing
- all terms from the pool?
- some terms from the pool?
Also,
- what fraction of an answer set is contained in another set?
- what is the likelihood of selecting a document containing all pre-targeted terms by querying individual terms?
- which query term is more discriminatory?
To answer these questions we need to query the system for all pre-targeted terms, individually and combined, and then conduct a probability analysis. To simplify the discussion, let's resource to some idealized conditions. First let's consider the case of individual query terms. In future articles, I will discuss the case of phrase searches and more realistic search conditions.
On Targeting Strategies
Before proceeding any further, I want to point out that this article is about targeting terms and documents, not about determining term importance and keyword competitiveness, as some SEOs may think. While term importance in a document is a semantic, content and context bearing property, keyword competitiveness is strongly influenced by offer-demand and ROI dynamics. Term and document targeting, on the other hand and as treated in this article, is about the likelihood of finding specific terms and documents through queries. It is more about selectivity; i.e. about the discriminatory power of a term when queried.
Since these three components (term importance, competitiveness, and selectivity) are related and affect each other, they also affect site traffic. Thus, an optimal term condition should be a function of at least these three; i.e.
Optimal Term Condition = f(importance, competitiveness, selectivity)
An optimum term then is not just a semantically valuable or competitive term. It need to be a term with a discriminating power. Other variables may influence the optimal condition of a term. I'm just limiting the discussion to these three since these are the most relevant to SEOs and users.
As we will see, targeting techniques are useful when we need to make educated guesses but we have no clear prior knowledge about the importance and usage of a term(s) within a context and or about the ROI performance or other marketing metrics associated to the term. This is not an extraordinary scenario and is frequently encountered when we need to optimize documents for new terms and combinations of terms. The following scenarios come to my mind:
- SEO work for a recent product or service
- SEO work for documents in other languages
- Business intelligence work
- Researching Competitors
An Idealized Example
For illustration purposes, let's consider the following scenario:
- a user wants to estimate the query behavior of the pre-targeted terms car insurance
- the user submits to a search engine the following queries (Q)
Q1 = k1 = car
Q2 = k2 = insurance
Q12 = k1 + k2 = car insurance - the queries are submitted using the default FINDALL (AND) mode.
- the search engine retrieves documents by matching query terms to terms found in documents.
- the retrieved documents contain all query terms, anywhere in the text and regardless for order and proximity.
- the following answer sets are retrieved
for Q1, n1 = 8
for Q2, n2 = 7
for Q12, n12 = 3
In term of c-indices and Venn Diagrams, these results can be summarized as follows:
Eq 1:
thus, c12 = 3/(8 + 7 -3) = 0.250 or 250 ppt.

Figure 1. Discretized Venn Diagram.
Targeting the n1 Answer Set
From Figure 1 it is clear that 3 out of 8 documents from the n1 set contain k1 = car and k2 = insurance. Thus, the probability that a user querying car would retrieve from this set (find) a document containing both terms is
P1 = n12/n1 = 3/8 = 0.375 or 37.5%
and the probability of randomly selecting from this set a document containing k1 = car without the term k2 = insurance is
1 - P1 = 1 - 0.375 = 0.625 or 62.5%
Evidently, it is also true that 1 - P1 = (n1 - n12)/n1 = (8 - 3)/8 = 5/8 = 0.625
That is, 5 documents contain the term car but not insurance, which is the same as stating that 5 out of 8 documents do not contain all pre-targeted terms.
Looking back at the P1 = n12/n1 = 0.375 ratio, this is a containment measure and a conditional probability. As a containment measure, it gives the fraction of the n1 set contained in the n2 set. Thus, 37.5% of n1 is contained in n2.
As a conditional probability this can be stated as
P(k2|k1) = P(k1,k2)/P(k1)
which is the probability of the k1,k2 pair conditioned on the appearance of k1.
Prior Knowledge
Let's put all these in "English" by considering the case of three users, Paul, Pete and Mary. To simplify, let's stick to the above example.
Suppose that Paul searches for car, Pete searches for insurance and Mary searches for car insurance. The probability that they would select (find) a document containing car and insurance from the above search results is:
37.5% in the case of Paul, by querying car
100% in the case of Mary, by querying car insurance
These are unconstrained probabilities. We are not imposing any external conditions. But what about if we do, for instance, by considering user's search behaviors?
Let's assume that Paul, Pete and Mary only care about randomly selecting documents from the top 2 results. More likely, neither Paul nor Pete, only Mary will be certain that the top two documents contain the terms car and insurance. Thus, in her case, the probability of selecting a documents with both terms is still 100%.
This would not be the case for Paul or Pete. For instance, in the case of Paul, we don't know a priori if the top two documents mention car only or car insurance only or if one mentions car and the other mentions car insurance. We simply don't know a priory how the documents will rank by just searching for car.
On the flip side, if I know a priori by monitoring results that consistently 50% of the top N documents mention car and insurance, then the maximum probability of Paul selecting a document with both terms from the top N results by just querying car would be 50% since the rest of the answer set "does not exist" for him.
What about Pete? To determine the probability that he would select (find) a document containing car and insurance by querying insurance we need to inspect the n2 answer set.
Targeting the n2 Answer Set
For the n2 answer set, the probability that Pete would select from this set a document containing the pre-targeted term k1 = car and k2 = insurance by querying insurance is
P2 = n12/n2 = 3/7 = 0.429 or 42.9%
and the probability of selecting from this set a document containing k2 = insurance without k1 = car is
1 - P2 = 1 - 0.429 = 0.571 or 57.1%
Again, 1 - P2 = (n2 - n12)/n2 = (7 - 3)/7 = 4/7 = 0.571
That is, in the n2 set, 3 out of 7 documents contain all pre-targeted terms k1 = car and k2 = insurance while 4 out of 7 do not. In addition, 42.9% of the n2 set is contained in the n1 set.
Odds and Odds Ratios from Search Results
The odds of selecting from the n1 set a document containing k1 = car and k2 = insurance are
P1/(1 - P1) = 0.375/0.625 = 0.60 or 3 to 5.
while the odds of selecting from the n2 such a document are
P2/(1 - P2) = 0.429/0.571 = 0.75 or 3 to 4.
To compare whether the probability of a certain event is the same for two groups (A and B) we compute an odds ratio of the form (Odds in B)/(Odds in A). If this is 1 then the event is equally likely in both groups. An odds ratio greater than one implies that the event is more likely in group B.
Taking A = n1 and B = n2, the odds ratio is
Odds Ratio = 0.75/0.60 = 1.25
That is, the event of finding and selecting a document containing both k1 = car and k2 = insurance is more likely in the n2 set.
Why Should We Care About This?
The objectives of this article is to present a simple technique for targeting terms and documents through queries; i.e., to determine the odds of finding documents containing a group of terms by querying some of the terms. Arguments and reasonings are presented in terms of co-occurrence theory. An idealized case was used for illustration purposes and to introduce important probability concepts. More likely most searchers use multiple terms when querying search engines.
If users consistently select and visit a document URL (a "click through" visit), then the calculations herein described may be relevant to both document discovery through searches and site traffic from search result pages (organic traffic).
In our example, the event of randomly selecting a document containing car and insurance is more likely to occur in n2 than in n1. However, these sets are obtained by querying separately Q1 = k1 = car and Q2 = k2 = insurance. The likelihood of such event (of selecting a document containing k1 and k2) is greater by querying insurance since this term, appearing in less number of results, is more discriminatory than car.
We can refer to insurance as having more "juice" than car when it comes to finding a document containing in no particular order both car and insurance. This is not surprising. The discriminatory power of a term is a function of the number of documents containing the term. The more documents in which a term appears then the less likely it is to be a discriminating term (3).
Applications to SEO Copywriting
From the optimization standpoint, when optimizing a document for a given search engine you may want to identify the discriminant terms in the target engine (e.g., Google, Yahoo!, ASK or MSN) and then emphasize more these. This makes more sense than trying to equally over emphasize all candidate key terms in a document (over optimization, keyword spamming) or a term just because is the first term in a phrase.
How about SEO copywrite work?
Let me make an anecdotal reference to a Newsweek article (Dec, 2005) in which the writer interviewed my friend Rand Fishkin (aka randfish) from Seomoz.org. In addition to several compounded misconstruction of facts regarding SEOs, the article has a confusing title: Hotwiring your Search Engine. As Rand states at this SEWF thread
"I'm very confused as to why the title is "hotwiring your search engine". It sounds like they meant to have it say "hotwiring your search engine rankings", but left it out... Problem is, that title means something entirely different."
Indeed it does. It shows copywrite work disconnected from semantics. On-topic analysis applied to Wikipedia's definition for hotwiring reveals the following action descriptors
bypassing
starting (as in relation to without)
detaching
crossing (as in relation to wires)
smashing
stealing (as in relation to cars)
It is not a surprise that Wikipedia suggests a semantic association between hotwiring and stealing (emphasis added):
"Although hotwiring might be legal if it is performed with the consent of the car's owner, it is usually associated with the crime of stealing the car."
Last time I did the corresponding co-occurrence analysis (12/12/05) Google returned:
428,000,000 results for k1 = car
18,300 results for k2 = hotwiring
9,470 results for k12 = car hotwiring
That is, an estimated P2 = n12/n2 = 0.52 or 52% of the results for hotwiring are relevant to car hotwiring.
Thus, the probability that someone searching for hotwiring will find documents targeting car hotwiring is indeed very high. In constrast, the probability that someone searching for car will find documents targeting car hotwiring is very small: P1 = n12/n1 = 9,470/428,000,000 = 2.2 x 10-5 or just 0.0022%. So, if I'm optimizing a document for car hotwiring I would emphasize hotwiring more than car.
As we can see, SEO copywriters and editors can use co-occurrence theory to assess creatives, write ups and copy while writing with search results in mind. In this way they can target both readers and search engines and get double benefits "for the price of one" -sort of speak.
Applications to Multi-Term Queries
Co-Occurrence Theory can also be used when k1 and k2 are sequence of terms (e.g., k1 = San Diego, k2 = insurance quotes). This is important since as mentioned before users tend to search using multiple terms.
In this case, we use a similar approach as described in our idealized example, but now we think in terms of content-bearing sequences, whose discriminatory power is based on number of context counts instead of number of document counts or word counts. In this case we need to resource to FINDALL and EXACT searches and apply the corresponding methods and procedures. This, along with real queries, will be covered in future articles.
Next: Co-Occurrence and the Scope of Terms
Prev: C-Indices and Measures of Associations
References
- Keywords Co-Occurrence and Semantic Connectivity; E. Garcia (2005).
- C-Indices and Measures of Associations; E. Garcia (2005).
- A survey on the use of relevance feedback for information access systems; Knowledge Engineering Review, 18(1):2003; Ian Ruthven and Mounia Lalmas (2003).

