C-Indices and Measures of Associations
Understanding C-Indices, Association Measures and Co-Occurrence Patterns
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 07/24/05
Article 2 of the series Keywords Co-Occurrence and Semantic Connectivity
Topics
Events and Probabilities
Co-Occurrence Patterns
Derivation of C-Indices
Working Expressions
Measures of Associations
C-Indices and Jaccard Coefficients
Assumptions and Limitations
References
Events and Probabilities
Events that co-occur are called non mutually exclusive while those that do not co-occur are called mutually exclusive (1, 2). For n number of non mutually exclusive events E1, E2,E3...En, their union probability is given by
Eq 1: 
where
- P1 is the ocurrence probability of n events.
- P2 is the co-occurrence probability of two events.
- P3 is the co-occurrence probability of three events.
- P4 is the co-occurrence probability of four events.
'
'
' - Pn is the co-occurrence probability of n events.
For three non mutually exclusive events (n = 3)
Eq 2:
Eq 3:
Eq 4:
and Eq 1 reduces to
Eq 5:
However, if E3 does not co-occur with E1,
Eq 6:
Similarly, if E3 does not co-occur with E1 and E2,
Eq 7:
which describes the union probability of two non mutually exclusive events (n = 2).
Co-Occurence Patterns
Table 1 shows the corresponding Venn Diagrams for these co-occurrence scenarios. The clustered scenario consists of three cases of pairwise co-occurrence and one case where three events co-occur. The table also shows second- and first-order pairwise co-occurrences. Note that E1 and E3 do not co-occur, but are present while "in transit" with E2. This is an example of pairwise, second-order co-occurrence.
We can go from a second- to a third-order co-occurrence scenario by considering a fourth event E4 co-occurring with either E3 or E1, but not with E2. For example, if E4 co-occurs with E3 then E1 and E4 are transitively related, exhibing third-order co-occurrence. In general high-order co-occurrence scenarios are obtained by considerig additional events. The expressions in transit co-occurrence or transitive co-occurrence are normally used in reference to second- and high-order co-occurrences.
The last scenario shown in the table has been fairly studied in IR and is an example of pairwise, first-order co-occurrence.
![]() |
We can make some approximations based on the degree of overlap between events.
For example, three non mutually exclusive events in which two events are poorly co-occuring can be approximated to a second-order co-occurrence scenario. An example of this would be the case of three co-occurring terms in which two are poorly co-occurring synonyms. Since synonyms rarely co-occur in documents but are found in the same context, this explains why these tend to exhibit poor first-order co-occurrence. In general second and high-order co-occurrence models are recommended in disambiguation studies and with terms that have a synonymity or polysemy nature.
Kontostathis and Pottenger (Lehigh University) have discussed the role of second-, third-, and higher-order co-occurrence in the paper Detecting Patterns in the LSI Term-Term Matrix (3). The authors found that patterns of transitive co-occurrence can indeed show up in LSI term-term matrices. Their findings suggest that long-range patterns of transitive co-occurrence play an important role in topic disambiguation and contextual analysis. Table 2 shows several examples exhibing such type of transitive co-occurrence. For clarity, I have removed all labels.
![]() |
In Towards Linguistic Steganography (4), Bergmair uses one of these pattern to discuss word disambiguation, lexical semantics and contextual meaning.
While Venn Diagrams allow users to understand transitive co-occurrence, there are better tools and projects designed to visualize co-occurrence relationships. One of such projects is DOPE, Drug Ontology Project for Elsevier (5). This project is aimed at investigating the possibility of providing access to multiple information sources in the area of life science through a single interface, the DOPE browser. DOPE generates co-occurrence patterns that resemble nerve cell connections. The patterns facilitate the identification of term-term relationships as well as concept discovery.
Derivation of C-Indices
Now that we have discussed the probabilistic nature of non mutually exclusive events and the transitive nature of high-order co-occurrences, we need to find some practical uses for these concepts. In order to work within a reference environment we need to normalize the co-occurrences.
This can be accomplished by taking intersection/union ratios; i.e., by dividing co-occurrence probabilities by the union probability (Eq 1),
Eq 8(a, b, c):
Thus, for three non mutually exclusive events the following co-occurrence indices ("c-indices") are obtained
Eq 9:
Eq 10:
where the c-indices are given by
Eq 11(a, b, c, d):
However, if E3 does not co-occur with E1, dividing Eq 6 by Eq 1 gives
Eq 12(a, b):
Similarly, if E3 does not co-occur with E1 and E2, dividing Eq 7 by Eq 1 gives
Eq 13:
Note from Eq 9 the additive property of pairwise co-occurrences,
c12 + c23 + c13
c12 + c23
Working Expressions
If we define an event E as the set of documents retrieved by querying a search engine for a term k, the following working expressions can be proposed
Eq 14:
Eq 15:
Eq 16:
Eq 17:
Eq 18:
Eq 19:
Eq 20:
where
n1 = set of documents retrieved by querying k1
n2 = set of documents retrieved by querying k2
n3 = set of documents retrieved by querying k3
n12 = set of documents retrieved by querying k12
n23 = set of documents retrieved by querying k23
n123 = set of documents retrieved by querying k123
Before proceeding any further I would like to revisit some measures of associations used in the IR literature.
Measures of Associations
According to van Rijsbergen (6) similarity is a measure of the association or relatedness between objects characterised by discrete-state attributes. Similarity scores quantify the likeness between objects assuming these can be grouped into clusters with similar attributes.
Some measures of associations are:
- Inner Product =
- Dice's Coefficient =
- Cosine Coefficient =
- Jaccard Coefficient =
The first three association measures have been used extensively with term weight and term vector models (7). The last measure, the Jaccard Coefficient (JC),
- characterizes the association strengths of paired data (6).
- has been used as a normalized correlation factor ci, j between pairs of terms (or stems), ki and kj (8).
- in molecular similarity studies, has been referred to as the Tanimoto Coefficient (9).
Although similar in form (see Eq 13 and Eq 20), c-indices are not Jaccard Coefficients. In the next section I discuss the main differences between these two metrics.
C-Indices and Jaccard Coefficients
The fact that c-indices and Jaccard Coefficients are of the same form may lead some to assume that these are identical measures. This is not the case since
- Jaccard Coefficients are extracted from pairwise co-occurrences, while c-indices can be computed for any number of non mutually exclusive events (pairwise, second-order and high-order co-occurrence).
- Jaccard Coefficients, extracted for instance from two-dimensional matrices, are computed assuming that event transposition does not matter; i.e.
- often, data extracted from online repositories (e.g.,search engine results) is query-dependent,
Consequently, event transposition does matter. This means that the sequence (ordering) used to search for the terms often affects the semantics (meaning) of the submitted query and, therefore, the set of retrieved results.
To illustrate, on 07/24/05 a search in Google for this work returned 812,000,000 results while a search for work this returned 651,000,000 results (try also with Google game and game Google). True that some search engines tend to return similar number of results with queries in which term transposition affects semantics. However, an inspection of those results reveal drastic differences. This is the case of querying Google for junior college and for college junior. The lesson here is that extracting Jaccard's Coefficients -or any association measure, for that matter- from query-dependent data while overlooking at term transposition is a highly questionable practice.
A c-index as defined by Eq 13 and Eq 20 reduces to a Jaccard Coefficient only when single-order co-occurrences are computed and when results are not affected by ordering; i.e., when c12 = c21. Such restrictive scenario is very rare with query sensitive co-occurrence data extracted from search engines since
- commercial systems are constantly updating their databases, adding and removing documents.
- ordering affects the semantics (meaning) of a query.
- the meaning of a query affects the set of retrieved results.
Precisely, these are the very same reasons of why sometimes searches conducted in FINDALL mode -in which order and proximity should not matter- produce different set of results. Far more different set of results are obtained when the searches are conducted in EXACT mode, where order and proximity does matter.
Unlike Jaccard coefficients, c-indices can be used to examine query-insensitive and query sensitive co-occurrence data. A query-insensitive scenario would be the case of co-occurrence data extracted from a document or from an IR system that is not affected by term transposition.
A query-sensitive scenario would be the case of co-occurrence data extracted from an ever changing corpus or universe. This makes the proposed c-index metric suitable for discovering studies in which co-occurence data must be extracted from search engine result pages (SERPs).
Assumptions and Limitations
C-indices can be used to discover query-sensitive co-occurrence data from commercial search engines. This means that results would be as good as permitted by the queried system.
If the target system is affected by precision and recall issues or if we are dealing with a concept-driven system -a machine that matches queries to concepts rather than to terms- the assumption that
Eq 23: set of documents retrieved = set of documents containing the queried term k
introduces an experimental error. The magnitude of this error could be minimized by querying a system of acceptable precision/recall performance and by instructing the system to return only documents with the queried terms. This could be accomplished with advanced search commands and when required by conducting all searches in EXACT mode.
Next: Targeting Documents and Terms Through Co-Occurrence Data
Prev: Keywords Co-Occurrence and Semantic Connectivity
References
- Rozanov, Y.A., Probability Theory: A Concise Course, Chapter 4; Dover, 1977.
- Kurtz, Max; Handbook of Applied Mathematics for Engineers and Scientists, Chapter 9; McGraw Hill, 1991.
- Detecting Patterns in the LSI Term-Term Matrix, April Kontostathis and William M. Pottenger.
- Towards Linguistic Steganography, Richard Bergmair (2005).
- DOPE, Drug Ontology Project for Elsevier
- Information Retrieval; C.J. van Rijsbergen, Chapter 3, Second Edition, Online Version, London: Butterworths (1979).
- The Term Count Model, E. Garcia
- Similarity Searching Overview
- Baeza-Yates, R., Ribeiro-Neto, B; Modern Information Retrieval, Chapters 2 and 5; Addison Wesley, 1999.



