Home - Contacts - Terms -

Mi Islita

Co-Occurrence and the Scope of Terms

"There are many reasons of why co-occurrence values between two terms can be low or high: term associations, the nature of the terms, external factors and the scope of the terms."

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 12/19/05

Article 4 of the series Keywords Co-Occurrence and Semantic Connectivity

Topics

Review

C-Indices and Probabilities

Scenario I

Scenario II

Scenario III

From Non Linear to Linear Regimes

An Example

What does this mean?

The Scope Effect: Handling of Outliers

Deviations due to Editorial Guidelines

Conclusion

References

Review

In Article 3 of this series we discussed the targeting of documents and queries through co-occurrence theory (1 - 3). The discussion was limited to pairwise co-occurrence under ideal conditions; i.e we assumed that

were k12 stands for k1 preceeded by k2 (i.e., k12 = k1 + k2).

Without losing generality, we learned that:

Figure 1 summarizes these ratios.

 Ratios

Figure 1. Probabilities, Odds and Odd Ratios in pairwise queries.

However, in Article 3 we did not explain how these ratios relate to co-occurrence indices.

This article serves two purposes. One is to show readers the connection between co-occurrence and probability measures. The other is to explain the emergence of apparent outliers in c-index calculations and how these can be handled. The material is organized as follows.

Explicit approximations for three co-occurrence scenarios are derived and from these extreme cases are considered. We then discuss the case of low probability values in which c12-indices and conditional probabilities are linearly related. This is followed by several examples where outliers are inspected. We end with some advice for the proper handling of these outliers and suggestions relevant to the optimization of documents ("search engine optimization"). Since the material expands on Article 3, you may want to revisit that work first.

C-Indices and Probability

Starting from the definition for a pairwise co-occurrence index

Equation 1:  c12-index


Dividing by n12 gives

Equation 2:  c12-index

From Figure 1 this is the same as writing

Equation 3:  c12-index

where we formally define P1 as P(k2|k1) = P(k1,k2)/P(k1). This is the probability of the k1,k2 pair conditioned on the appearance of k1. Equation 3 shows that c12-index is non linearly related with both P1 and P2.

From the P1/P2 ratio three scenarios are possible:

  1. Scenario I: that P1/P2 = 1, when P1 = P2 and n1 = n2
  2. Scenario II: that P1/P2 > 1, when P1 > P2 and n1 < n2
  3. Scenario III: that P1/P2 < 1, when P1 < P2 and n1 > n2

These scenarios give rise to several extreme cases.

Scenario I

If P1 = P2 the fraction of documents in n1 relevant to k1 targeting k12 and the fraction of documents in n2 relevant to k2 targeting k12 are the same. Thus, there are equal chances of finding a document that targets k12 by querying either term.

Equation 3 reduces to

Equation 4:  c12-index

The relationship between c12-index and the conditional probabilities is still non linear. However, when P1 = P2 and are very small we can safely drop the terms and Equation 4 reduces to

Equation 5:  c12-index

describing now a linear relationship between c12-index and probabilities. Although uncommon, Scenario I and this extreme case are not theoretically impossible. However, Scenario II and III are more common than Scenario I. These can be used to explain the emergence of apparent outliers in rutinary c12-index calculations.

Scenario II

In Scenario II, an extreme case is obtained when P1/P2 >> 1. This reduces Equation 3 to

Equation 6:  c12-index

This occurs when n1 << n2; i.e. k1 returns far less results than k2. This happens because either k1 is far more discriminatory than k2 or because there are few documents relevant to k1 in the database. k2, on the other hand, is either a broader term or there are far more documents relevant to it in the database, reducing the value of P2. If P2 is very small we can safely drop it from the denominator of this expression.

Equation 6 reduces to

Equation 7:  c12-index

From the on-topic analysis standpoint, these extreme c12-indices are observed in two-term queries when the first term, k1, is a very restrictive, narrower term, while k2 is a broader term or a term with poor discriminating power (e.g., a common term, a stopword).

Scenario III

Scenario III is similar to Scenario II but with roles reversed.

In Scenario III the following extreme case is obtained when P1/P2 <<< 1:

Equation 8:  c12-index

This occurs when n1 >> n2; i.e. k1 returns far more results than k2. This happens because k1 is either a broader term or there are more documents relevant to k1 in the database. k2, on the other hand, is either a very narrower term or there are far fewer documents in the database relevant to it when compared with k1. The value of P1 should be small. If P1 is very small, we can drop it from the denominator of this equation and the c12-index reduces to

Equation 9:  c12-index

Again, from the on-topic analysis standpoint an example of this would be the case of k2 being a very narrower term while k1 being a broader term or a very common term.

From Non Linear to Linear Regimes

From Figure 1 and Equations 6 and 8 is clear in Scenario II and III that the c12-index reduces to odd values in which the relationship between c-indices and probabilities is still non linear. As these scenarios approach their extreme cases (Equations 7 and 9), the c-indices reduce to mere conditional probabilites, exhibiting an almost linear relationship with these. Figure 2 shows the transition from non linearity to linearity in terms of P1.

 Linearity

Figure 2. Linear and Non Linear Regimes in c12-index.

Note that when P1 is equal or less than 0.10 an almost linear regime is achieved. A similar plot with roles and arguments reversed can be constructed for P2. This is not surprising since expressions of the form x/(y - x) with y > x behave almost linear as x vanishes.

An Example

To illustrate, let's revisit Article 3 and the car hotwiring example, where the following data was obtained from Google:

428,000,000 results for k1 = car
18,300 results for k2 = hotwiring
9,470 results for k12 = car hotwiring

This gives a c12-index value of

9,470/(428,000,000 + 18,300 - 9,470) = 2.2 x 10 -5 or 0.022 ppt (parts per thousand)

By just looking at this result one may think that the terms are poorly associated as to be used in the copy. After all the co-occurrence index is too small.

The fact is that this is a small number, not because the terms are not semantically related but because the first term k1 = car, being too broad, masks the magnitudes of the other two quantities, k2 = hotwiring and k12 = car hotwiring, effectively introducing a linearity that is not present in Equation 3. Under these circumstances the computed index behaves as a mere conditional probability. Indeed, from Equation 9 of Scenario III we obtain

P1 = n12/n1 = 9,470/428,000,000 = 2.2 x 10-5

What does this mean?

What does this mean for an SEO? Well, the fraction of documents relevant to car targeting car hotwiring is small in comparison with the fraction of documents relevant to hotwiring targeting car hotwiring.

The probability that someone searching for car will retrieve (find) documents targeting car hotwiring is very small: 2.2 x 10-5 or 0.0022%. In contrast, the probability that someone searching for hotwiring will retrieve (find) documents targeting car hotwiring is very high: P2 = n12/n2 = 9,470/18,300 = 0.5174 or almost 52%. One would expect this to affect site traffic in some way.

From the optimization standpoint if I want to optimize a document for car hotwiring I would probably emphasize hotwiring more than car in the copy. This is more sounded than trying to equally overemphasize both terms (keyword spam, overoptimization).

Revisiting Equation 3, this expression tells us that when k1 and k2 are similar in scope or return results of similar magnitudes the computed c12-index is not a conditional probability but depends on the relative fraction of documents in n1 and n2 containing both terms. This dependency is not present when k1 and k2 are very dissimilar in scope or return far different number of results.

Since c12-indices can behave as probabilities, odd values or as prescribed by Equation 3, when comparing co-occurrence data one must be certain that comparisons are not made between c12-indices from different regimes (linear vs non linear). In this sense, trying to compare c12-indices by merely using a "c-index calculator" or a slidding scale is not a wise thing to do. One needs to understand when and why some combinations give rise to apparent outliers.

The Scope Effect: Handling of Outliers

Since terms which co-occur more frequently tend to be related, we can inspect how related two terms are by measuring n12, the number of documents relevant to both terms. Poor co-occurrence means that n12 should be a small value. But n12 affects both c12-indices and the probabilites, P1 and P2. We expect that as n12 decreases all these ratios should decrease.

In practice, what we observe is that there exist special cases where even when a c12-index can be small, one probability value can be small and the other can be high. This occurs when there is a large difference between n1 and n2; i.e., when their magnitudes diverge by more than 1 unit. These are the cases that give rise to apparent outliers and cannot be rejected on the ground of their c12-indices alone.

A high probability in one of the answer sets (n1 or n2) means that the set contains a considerable fraction of documents targeting both k1 and k2. From the optimization standpoint (e.g., phrase selection, document targeting), this cannot be overlooked and the apparent outliers should be retained. To illustrate this point, I have collected co-occurrence data for some apparent outliers and displayed these in Figure 3. Probability values are given up to 3 decimal places to simplify presentation. For comparison purposes, I've also included cases with high c12-indices that are not outliers. As usual, c-index values are given in parts per thousand (ppt).

 The Scope Effect

Figure 3. Pairwise Co-Occurrence Data.

Here I have separated the cases in two blocks based on the relative values of n1 and n2.

Note for the n1 > n2 block the presence of the apparent outliers american airlines and american idol. Compared with the rest in this block they exhibit low c12-indices. However, almost 42% of documents relevant to airlines target american airlines while almost 66% of documents relevant to idol target american idol. Still their c12-index are small since american is a very broad term. Their absolute logs are indeed greater than 1. The "scope effect" clearly affects the notion of relatedness that otherwise one would estimate from the c12-index.

The n1 < n2 block includes more outliers. In particular note that tivo online, regaetton music (a hot, latino dance genere originated in "mi islita" Puerto Rico by combining reggae and rap) and sierpinski triangle, all exhibit low c-indices. However, almost 50% of the documents relevant to tivo target tivo online, 74% of the documents relevant to reggaeton target reggaeton online, and 22% of the documents relevant to sierpinski target sierpinski triangle. The "scope effect" is clear as can be seen from the corresponding columns for n2 and |log(n1/n2)|. In this case these outliers should not be rejected by just taking their c12-indices at face value.

The last column, the odds ratio, gives the likeliness of finding a document containing k12 in n1 or n2. As defined in this article, when an odd ratio is greater than 1 the event is more likely in the n2 set. As expected, this occurs when n1 > n2. For american idol this ratio is 170. Thus, documents targeting american idol are more likely to be found by users searching for idol than for american.

When we can safely rejected outliers? Here we may need to make a decision based on empirical judgements. As a rule of thumbs, I generally reject an outlier when two conditions are fulfilled:

  1. the absolute log of the n1/n2 ratio is 1 or greater than 1; i.e., |log(n1/n2)| >= 1
  2. both P1 and P2 are less than 0.10; i.e., P1 < 0.1 and P2 < 0.1

The first mark accounts for any divergence in magnitudes between n1 and n2 and the second is derived from Figure 2. Based on these criteria, the only outlier from Figure 3 I would not optimize a copy for (reject) is aloha indiana.

Deviations due to Editorial Guidelines

The discussion herein provided is aimed at understanding how the scope of terms affect pairwise co-occurrence calculations from commercial search engines. With IR systems and directories dedicated to specialty topics or reviewed by indexers our observations may not apply since editorial guidelines may favor specific terms. For instance the Medline Tutorial from the McGill University Health Sciences Library (4, 5) advices their users to consider the following (emphasis added)

"Remember that indexers have been mandated to index articles with the most specific MeSH heading available. Therefore, very specific, narrow terms indented below broad subject headings often generate higher retrievals since most authors tend to publish papers on precise, well defined areas of medicine."

"Examine the Tree structure above. Note that the broadest term, Colorectal Neoplasms, yields 9492 references while the narrower term, Colonic Neoplasms, retrieves 9740 references. This is not surprising since most authors write about specific medical conditions rather than more general ones."

Searching these systems is not the same as searching in Google. In fact, as of 12/17/05 and unlike Medline, a search in Google returned more results for colorectal neoplasms (682,000) than for colonic neoplasms (576,000). As expected the broadest term returned more results. Note also that in the Medline system the two words colorectal neoplasms is considered one broadest term.

Conclusion

There are many reasons of why co-occurrence values between any two terms can be either low or high. One reason is due to the lack/presence of semantic associations between terms.

From Figure 3 is clear that aloha is more associated with hawaii than with indiana. This can be seen by comparing their c-indices: aloha hawaii c12-index = 18.30 ppt while aloha indiana c12-index = 3.33 ppt. Their c-indices are far different, not because of scope differences (Hawaii and Indiana are both states of U.S.A.) but because one thinks in Hawaii, not in Indiana, when hearing the term "aloha". Thus, one would expect more documents co-citing "aloha" with "Hawaii". The difference in c-indices in this case is due to a semantic association effect.

Another reasons that might affect co-occurrence are external factors, such as natural disasters, world events, editorial guidelines, the launching of a new product or service, or even seasonal trends. All these can influence the number of document entries found by querying a database --for blogosphere studies, add "meme" and other dynamic factors like virtual community behaviors.

Another reason for obtaining low or high c12-index values is ontology; i.e., the nature of the terms involved. Whether we are dealing with nouns, verbs, or articles does matter. Whether we are dealing with specific combinations of terms (noun-noun, noun-verb, verb-noun, synonyms, homographs -same spelling, different meaning) is important.

Yet another reason is term scope. I wanted to single out this topic, since the scope of terms in co-occurrence analysis and association studies is often overlooked.

Next: Temporal Co-Occurrence: How does a Developing Event Affects Search Results?

Prev: Targeting Documents and Terms Through Co-Occurrence Data

References
  1. Keywords Co-Occurrence and Semantic Connectivity; E. Garcia (2005).
  2. C-Indices and Measures of Associations; E. Garcia (2005).
  3. Targeting Documents and Terms Through Co-Occurrence Data E. Garcia (2005).
  4. Medline Tutorial: Broader/Narrower McGill University.
  5. McGill University Health Science Library

Thank you for using this site.
Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -