Keywords Co-Occurrence and Semantic Connectivity
An Introductory Series on Co-Occurrence Theory for Information Retrieval Students and Search Engine Marketers
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 06/28/05
Article 1 of the series Keywords Co-Occurrence and Semantic Connectivity
Topics
About this Series
Why Co-Occurrence is Important
Understanding Co-Occurrence
Broader and Narrower Terms
Global Co-Occurrence
Normalized Co-Occurrence
Avoiding Blindfold Calculations
References
About this Series
Note. I am currently updating and expanding this series of articles to include additional material and new advances on co-occurrence theory. So far, only Article 1 - 3 have been updated. Enjoy the experience.
This is the original series that started all the buzz about the use of keywords co-occurrence and semantic connectivity concepts in the search engine marketing community. Before delving into these concepts I would like to explain the origins of this series.
After the dotcom burst and few unexpected misfortunes in the U.S. I returned to Puerto Rico, my little island ("mi islita", in Spanish) and safe heaven. It was the Spring of 2001. As part of the healing process I spent my time working out of my garage on a query routing system that executes remote searches. Recall, precision and self-similar (fractal) issues encountered during the project forced me to think about the concept of co-occurrence and word patterns in repositories that reside, not in an IR system under controlled lab conditions but, in an environment full of commercial noise: the Web. C-indices and EF-Ratios were born.
I formally presented a detailed explanation of pairwise co-occurrence at two search engine conferences I co-organized with local universities during the summers of 2002 and 2003 and at lectures given at the Graduate School of Business of the University of Turabo, Gurabo, PR. During the summer of 2004 I introduced these concepts to the search engine marketing community in a thread titled Keywords Co-Occurrence and Semantic Connectivity (1). Since then, many have incorporated co-occurrence and word pattern concepts into their optimization mix. I decided to write, expand, and rewrite this series of articles so more search engine marketers could be able to understand and incorporate co-occurrence concepts in their marketing projects.
Why Co-Occurrence is Important
In a nutshell, terms, stems, and concepts that co-occur more frequently tend to be related. For instance, when we hear the term "aloha" we immediately think of Hawaii, not of Montana or Indiana. This is a semantic association between two terms.
Why we should care about co-occurrence? Some compelling reasons are:
- Keyword-brand associations.
- Brand visibility across search engines.
- Co-citation of products and services.
- Search volume co-occurrence (Co-Volume).
- Positioning of documents in search results pages (SERPs).
- Keywords research and terms discovery.
- Analysis of seasonal trends.
- Design of thematic sites.
In Computation of Word Associations Based on the Co-Occurrences of Words in Large Corpora Manfred Wettler and Reinhard Rapp (University of Paderborn) have found that norm associations based on words co-occurrence can be used to (2)
- generate search terms for document retrieval in bibliographic databases.
- predict what effects word usage in advertisments has on people.
- find the correct translations for semantically ambiguous words.
Clearly, search engine marketers can generate important Web analytics from co-occurrence data.
Co-occurrence concepts are useful in other fields. For instance, co-occurrence measures facilitate the work of developers interested in building thesaurus-based applications and query expansion models. Word co-citation is also important to business intelligence and homeland security analysts interested in monitoring word patterns that evolve in time or that fluctuate in a given Web community. There are dozen of reasons for measuring co-occurrence and word patterns in other fields.
Understanding Co-Occurrence
Depending on the source, co-occurrence can be
- Global; extracted from databases
- Local; extracted from individual documents
- Fractal; extracted from self-similar, scaled distributions
The theoretical framework is different in each case. In addition, co-occurrence data can be query-sensitive, as found in commercial search engine databases. This series focuses on this type of global co-occurrence.
Co-occurrence data can be used to extract lists of related terms or lists of synonyms. I must emphasize that the scope, nature and relationship between discovered terms as their environment affect the type of information that one could extract from co-occurrence sources.
Among others, the following does matter when working with co-occurrence data
- scope; i.e., whether the words behave as broader or narrower terms in a given context.
- type; i.e., whether we are dealing with nouns, verbs, adjectives, stems, etc
- synonymity; i.e., whether we are dealing with synonyms.
- architecture; i.e., whether the documents reside in a horizontal, topic-specific vertical, or regional directory
- seasonality; i.e., whether we are dealing with repositories containing seasonal trends and periodic fluctuations.
- sequencing; i.e., the order in which terms are queried or appear in documents.
- polysemy; i.e., whether we are dealing with terms with multiple meanings.
- cognates; i.e., whether we are dealing with different terms with same meaning in different languages.
- query modes; i.e., the retrieval modes used.
- other reasons not listed here.
As we can see a comprehensive understanding of the co-occurrence phenomena is necessary. Thus, the practice of blindfold computing co-ocurrence values from "c-index calculators" or extracting correlations from "c-index tables" without a clear understanding of the nature and relationship between words and their environment is misleading.
To illustrate, word association information from terms that are not synonyms can be extracted by applying first-order statistics to the frequencies of word co-occurrences in documents. However, discovery and generation of lists of synonyms requires second-order statistics. In some cases, second- and high-order statistics may be required to deal with a phenomenon known as transitivity. In the case of synonyms, second-order statistics is required since synonyms rarely occur together but appear in similar contexts.
Since contextuality affects co-citation frequency, it is not strange to obtain low co-occurrence data when querying a search engine for combinations of
- synonyms.
- invented terms.
- dictionary terms.
- broader and narrower terms.
- certain group of terms (NV, VN, etc).
To sum up, contextuality does matter.
Broader and Narrower Terms
Extracting co-occurrence data without a clear understanding of the underlying theory can induce some to draw wrong conclusions. A simple example illustrates this point. Suppose we want to do a query expansion for the noun term "dog" using two other nouns, "pet" and "canine". To identify keyword terms, let's use the letter k and numbered subscripts. Let's now construct two possible co-occurrence scenarios and for clarity let's ignore for a moment other possible scenarios, term combinations and sequences. Thus,
- scenario 1: k1 = dog, k2 = canine
- scenario 2: k1 = dog, k2 = pet
As of 06/16/05, searches in Google for these terms return
- 53,400,000 results for dog
- 55,800,000 results for pet
- 3,570,000 results for canine
"dog" and "pet" return far more results than "canine". This is not surprising and has a lot do with the scope of the terms. Unlike "canine", "dog" and "pet" are too broad in their scope. We call these broader terms. Note also that "canine" is more restrictive and specific. Terms limited in their scope in relation to other terms or contexts are called narrower terms. Note that
- there is a synonymity relationship between "canine" and "dog" but not between "canine" and "pet" or "pet" and "dog".
- "canine" has different meanings (polysemy). According to WordNet, "canine" can be used as a noun or adjective, each having different meanings (3).
- "canine" is one of those terms that posses a meaning within a meaning. To illustrate we can talk about the "canine of a canine", "radio of a radio", and similar expressions. Used in this manner, the terms behave as having a scope within a scope (or context within a context (fractality)].
Thus, the above scenarios can be viewed as combinations of
- broader noun term + broader noun term = NbNb
- broader noun term + narrower noun term = NbNn
Let's now examine how does the nature of these terms affect their global co-occurrence in Google.
Global Co-Occurrence
In Google and most search engines, the default query mode is FINDALL, also known as AND. In this mode searches are conducted regardless for order and proximity. This means that the system tends to return results in which ALL the queried terms appear in the documents, anywhere and in no particular order.
As of 06/16/05 searches in Google for these terms return
- scenario 1: 12,800,000 for the query, k12 = k1 + k2 = dog pet
- scenario 2: 1,710,000 for the query, k12 = k1 + k2 = dog canine
Now both queries return less number of documents. This is not surprising since the new set of results n12 and containing k1 and k2 must be a subset of n1 and n2; i.e., the sets containing k1 only or k2 only.
In scenarios 1 and 2, each n12 sets can be taken for estimates of the overall frequency of documents in Google containing all the queried terms. The expression "global co-occurrence" is more than appropriate.
Let's now interpret these results. The term "dog" is more frequently co-cited with "pet" than with "canine" since:
- in scenario 1 we are combining two broader terms.
- in scenario 1 the terms are not synonyms.
- in scenario 2 we are combining a broader term with a narrower term.
- in scenario 2 the terms are synonyms and synonyms rarely occur together but appear in similar contexts.
Normalized Co-Occurrence
The global co-occurrence I have described is an absolute or unnormalized metric. Note that different set of queries (k1, k2, and k12) retrieve different sets of documents (n1, n2, and n12). For the purpose of comparing term co-occurrences between different queries and sets of retrieved documents, one would need to normalize these within a practical scale. Normally one would prefer to compare values running within a practical scale, for instance running from 0 to 1, such that one would be able to compare relative, normalized co-occurrence.
This normalized co-occurrence is what I define as the "Co-Occurrence Index" or C-index. In the case of pairwise co-occurrence, i.e., co-citation frequency between two and only two terms k1 and k2, the C-index is given by
Eq 1:
where
- c12 = 0 when n12 = 0; i.e., k1 and k2 do not co-occur (terms are mutually exclusive).
- c12 > 0 when n12 > 0; i.e., k1 and k2 co-occur (terms are non mutually exclusive).
- c12 = 1 when n12 = n1 = n2; i.e., k1 and k2 co-occur whenever either term occurs.
Since c-index values are very small quantities, I like to multiply Equation 1 by a thousand and express the final result in parts per thousand (ppt). So the working scale runs from 0 to 1000 ppt. This is merely done to have an expanded scale where small values are easier to compare and contrast. In future article of this series I present a complete derivation of Equation 1, complete with c-indices for high-order co-occurrence (co-occurrence involving multiple terms). I will also explain the difference between c-indices and other type of similarity measures (Jaccard's, Dice's, and Salton's coefficients as well as other similarity measures). For now this is all we need to know.
Applying Equation 1 to our previous examples gives
- scenario 1: (12,800,000/(53,400,000 + 55,800,000 - 12,800,000))*1000 = 132.7801 = 133 ppt
- scenario 2: (1,710,000/(53,400,000 + 3,570,000 - 1,710,000))*1000 = 30.9446 = 31 ppt
Taking ratios reveals that "dog" co-occurs seven times more (12,800,000/1,710,000 = 7) with "pet" than with "canine" [or four times more (133/31 = 4) in terms of c-indices].
Avoiding Blindfold Calculations
I want to finish this first article by highlighting some pitfalls we should avoid when computing co-occurrence. Scenario 2 involves a broader and a narrower term, where large and small quantities are added and substracted. In these cases, large quantities tend to mask the contribution of small quantities. This in turns results in small c-index values. Also, and as mentioned before, scenario 2 involves synonyms. Synonyms tend to have low co-citation frequency due to contextuality issues.
These observations illustrate that the blindfold calculation of c-index values without considering, among other things, the nature of the terms involved could lead to wrong conclusions. Certainly two synonyms are semantically related, even when they exhibit low co-citation frequency.
Note. This phenomenon can be explained in terms of Syntagmatic and Paradigmatic Association theory (4). Syntagmatic associations are terms that frequently occur together. Paradigmatic associations are terms with high semantic similarity. These type of associations allow us to understand why synonyms do not tend to co-occur together. This has a lot to do with contextuality or lexical neighborhoods. I prefer to delay the discussion of this phenomenon, for now.
The reverse of terms with low co-citation but semantically related is the case of terms that co-occur too often but do not add any semantic value to documents. For instance, in English the terms "it" and "is" co-occur too often. As of 06/19/05 searches in Google for these terms return
- 1,610,000,000 results for k1 = it;
- 2,460,000,000 results for k2 = is;
- 1,370,000,000 results for k12 = it is
- c12-index = (1,370,000,000/(1,610,000,000 + 2,460,000,000 - 1,370,000,000)) = 0.507 or 507 ppt
While this is a high c-index value, the terms do not add any semantic value to documents. They are indeed stopwords. The lesson here is that before developing or interpreting Web analytics, tools (i.e., calculators, reference tables, etc) or marketing programs based on c-indices or co-citation metrics, the search engine marketer must have a clear understanding of co-occurrence theory. This is precisely the purpose of this series of articles; i.e. to provide some guidelines on the proper use of co-occurrence theory and semantic concepts in a marketing environment.
In the next article I elaborate a bit on the mathematical basis of the co-occurrence measure herein introduced and on the fundamental differences between c-indices and other types of association measures.
Knowledge is Power.
Next: C-Indices and Other Measures of Associations
References
- E. Garcia (2005) Keywords Co-Occurrence and Semantic Connectivity [Available at http://forums.searchenginewatch.com/showthread.php?t=48&page=1&pp=20]
- Manfred Wettler and Reinhard Rapp Computation of Word Associations Based on the Co-Occurrences of Words in Large Corpora [Available at http://citeseer.ist.psu.edu/cache/papers/cs/11567/http:zSzzSzwww.fask.uni-mainz.dezSzuserzSzrappzSzpaperszSzwvlc93zSzwvlc93.pdf/wettler93computation.pdf]
- Answers.com WordNet [Available at http://www.answers.com/canine&r=67]
- Reinhard Rapp The Computation of Word Associations: Comparing Syntagmatic and Paradigmatic Approaches [Available at http://acl.ldc.upenn.edu/C/C02/C02-1007.pdf]

