C-Indices and Measures of Associations
Understanding C-Indices, Association Measures and Co-Occurrence Patterns
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 06/24/08
Article 2 of the series Keywords Co-Occurrence and Semantic Connectivity
Topics
Events and Probabilities
Co-Occurrence Patterns
Derivation of C-Indices
Working Expressions
Measures of Associations
C-Indices and Jaccard's Coefficients
Assumptions and Limitations
References
Events and Probabilities
Co-occurring events are called non mutually exclusive and those that do not co-occur are called mutually exclusive (1, 2). For n number of non mutually exclusive events E1, E2,E3...En, their union probability is given by
Eq 1: 
where
- P1 is the ocurrence probability of n events.
- P2 is the co-occurrence probability of two events.
- P3 is the co-occurrence probability of three events.
- P4 is the co-occurrence probability of four events.
'
'
' - Pn is the co-occurrence probability of n events.
For three non mutually exclusive events (n = 3)
Eq 2:
Eq 3:
Eq 4:
and Eq 1 reduces to
Eq 5:
However, if E3 does not co-occur with E1,
Eq 6:
Likewise, if E3 does not co-occur with E1 and E2,
Eq 7:
which describes the union probability of two non mutually exclusive events (n = 2).
Co-Occurence Patterns
Table 1 shows the corresponding Venn Diagrams for these scenarios. The clustered scenario consists of three cases of pairwise co-occurrence and one case wherein three events co-occur. The table also shows second- and first-order pairwise co-occurrences. For instance in the in-transit scenario E1 and E3 do not co-occur, but are present while "in transit" with E2. This is an example of second-order co-occurrence.
We can go from a second- to a third-order co-occurrence scenario by considering a fourth event E4, not shown in the table, co-occurring with either E3 or E1, but not with E2. For example, if E4 co-occurs with E3 then E1 and E4 are transitively related, exhibing third-order co-occurrence. The expressions in transit co-occurrence or transitive co-occurrence are normally used in reference to second- and high-order co-occurrences.
The first-order scenario shown in the table has been fairly studied in IR and is an example of first-order co-occurrence.
![]() |
We can make some approximations based on the degree of overlapping between events.
For example, three non mutually exclusive events in which two events are poorly co-occuring can be approximated as a second-order co-occurrence scenario. An example of this would be the case of three co-occurring terms wherein two are poorly co-occurring synonyms. Since synonyms rarely co-occur in documents, but are found in similar contexts, this might explains why they tend to exhibit poor first-order co-occurrence. In general second and high-order co-occurrence models are recommended in disambiguation studies.
Kontostathis and Pottenger (Lehigh University) have discussed the role of second-, third-, and higher-order co-occurrence in the paper Detecting Patterns in the LSI Term-Term Matrix (3). The authors found that patterns of transitive co-occurrence can indeed be present in LSI term-term matrices. Their findings suggest that long-range patterns of transitive co-occurrence might play an important role in topic disambiguation and contextual analysis. Table 2 shows several examples exhibing such type of transitive co-occurrence. For clarity, we have removed all labels.
![]() |
In Towards Linguistic Steganography (4), Bergmair uses one of these pattern to discuss word disambiguation, lexical semantics and contextual meaning.
Whereas basic Venn Diagrams are acceptable tools for visualizing transitive co-occurrence, better tools and projects do exist. One of such projects is DOPE, Drug Ontology Project for Elsevier (5). This project is aimed at investigating the possibility of providing access to multiple information sources in the area of life science through a single interface, the DOPE browser. DOPE generates co-occurrence patterns that resemble nerve cell connections. The patterns facilitate the identification of term-term relationships as well as concept discovery.
Derivation of C-Indices
So far we have discussed the probabilistic nature of non mutually exclusive events, the transitive nature of high-order co-occurrences, and few tools or vehicles for visualizing these. We now need a normalization framework and find some practical uses for these concepts.
Taking intersection/union ratios; i.e., dividing co-occurrence probabilities by the union probability (Eq 1),
Eq 8(a, b, c):
Thus, for three non mutually exclusive events the following intersection/union ratios, herein termed co-occurrence indices ("c-indices"), are obtained
Eq 9:
Eq 10:
where the c-indices are given by
Eq 11(a, b, c, d):
However, if E3 does not co-occur with E1, dividing Eq 6 by Eq 1 gives
Eq 12(a, b):
Similarly, if E3 does not co-occur with E1 and E2, dividing Eq 7 by Eq 1 gives
Eq 13:
Note from Eq 9 the additive property of pairwise co-occurrences,
c12 + c23 + c13
c12 + c23
Working Expressions
If we define an event E as the set of documents retrieved by querying a search engine for a given term(s), the following working expressions can be proposed
Eq 14:
Eq 15:
Eq 16:
Eq 17:
Eq 18:
Eq 19:
Eq 20:
where
n1 = set of documents retrieved by querying k1
n2 = set of documents retrieved by querying k2
n3 = set of documents retrieved by querying k3
n12 = set of documents retrieved by querying k12
n23 = set of documents retrieved by querying k23
n123 = set of documents retrieved by querying k123
Before proceeding any further I would like to revisit some measures of associations used in the IR literature.
Measures of Associations
According to van Rijsbergen (6) similarity is a measure of the association or relatedness between objects characterised by discrete-state attributes. Similarity scores quantify the likeness between objects assuming these can be grouped into clusters with similar attributes.
Some measures of associations are:
- Inner Product =
- Dice's Coefficient =
- Cosine Coefficient =
- Jaccard's Coefficient =
The first three association measures have been used extensively with term weight and term vector models (7). The last measure, the Jaccard's Coefficient (JC),
- characterizes the association strengths of paired data (6).
- has been used as a normalized correlation factor ci, j between pairs of terms (or stems), ki and kj (8).
- in molecular similarity studies, has been referred to as the Tanimoto Coefficient (9).
C-indices are not Jaccard's Coefficients. In the next section we discuss their main differences.
C-Indices and Jaccard's Coefficients
In spite of the fact that for a two-term scenario (Eq 20) a c-index resembles a Jaccard's Coefficient, these are not equivalent measures.
- Jaccard's Coefficients are computed for two non mutually exclusive events. By contrast, c-indices can be computed for any number of non mutually exclusive events.
- Jaccard's Coefficients are similarity measures, whereas c-indices are not.
- Jaccard's Coefficients extracted for instance from two-dimensional matrices, are computed assuming that event transposition does not matter (event symmetry); i.e.
Regarding the last point, usually data extracted from online repositories (e.g.,search engine results) is query-dependent. Therefore,
Consequently, event transposition does matter. The order in which terms are declared in a query affects the semantics (meaning) of the information need sought and, therefore, the set of retrieved results.
To illustrate, on 07/24/05 a search in Google for this work returned 812,000,000 results while a search for work this returned 651,000,000 results (try also with Google game and game Google). True that some search engines tend to return similar number of results with queries in which term transposition affects semantics. However, an inspection of those results reveal drastic differences. This is the case of querying Google for junior college and for college junior. The lesson here is that extracting Jaccard's Coefficients -or any association measure, for that matter- from query-dependent data while overlooking at term transposition is a highly questionable practice.
Eqs 13 and 20 resemble a Jaccard's Coefficient only when single-order co-occurrences are computed and when results are not affected by ordering; i.e., when c12 = c21 --and even so this does not make a c-index a similarity measure. Furthermore, such restrictive scenario (c12 = c21) is very rare with query sensitive co-occurrence data extracted from a search system since
- commercial systems are constantly updating their databases, adding and removing documents.
- word order affects the semantics (meaning) of a query.
- the meaning of a query affects the set of retrieved results.
Precisely, these are the very same reasons of why sometimes searches conducted in FINDALL mode -in which order and proximity should not matter- produce different set of results. Far more different set of results are obtained when the searches are conducted in EXACT mode, where order and proximity does matter.
Unlike Jaccard coefficients, c-indices can be used to examine query-insensitive and query sensitive co-occurrence data. A query-insensitive scenario would be the case of co-occurrence data extracted from a document or from an IR system that is not affected by term transposition.
A query-sensitive scenario would be the case of co-occurrence data extracted from an ever changing corpus or universe. This makes the proposed c-index metric suitable for discovering studies in which co-occurence data must be extracted from search engine result pages (SERPs).
Assumptions and Limitations
C-indices can be used to discover query-sensitive co-occurrence data from experimental retrieval systems and commercial search engines. The results can be as good as permitted by the queried system. If a system reformulates queries and answer sets in the background or returns partial answer sets this can impact co-occurrence estimates and the end user must be aware of that.
If the target system is affected by precision and recall issues or if we are dealing with a concept-driven system -a machine that matches queries to concepts rather than to terms- the assumption that
Eq 23: set of documents retrieved = set of documents containing the queried term k
introduces an experimental error. The magnitude of this error could be minimized by querying a system at an acceptable precision/recall performance level and by instructing the system to return only documents with the queried terms. This could be accomplished with advanced search commands, and when required, by conducting searches in EXACT mode.
Next: Targeting Documents and Terms Through Co-Occurrence Data
Prev: Keywords Co-Occurrence and Semantic Connectivity
References
- Rozanov, Y.A., Probability Theory: A Concise Course, Chapter 4; Dover, 1977.
- Kurtz, Max; Handbook of Applied Mathematics for Engineers and Scientists, Chapter 9; McGraw Hill, 1991.
- Detecting Patterns in the LSI Term-Term Matrix, April Kontostathis and William M. Pottenger.
- Towards Linguistic Steganography, Richard Bergmair (2004).
- DOPE, Drug Ontology Project for Elsevier
- Information Retrieval; C.J. van Rijsbergen, Chapter 3, Second Edition, Online Version, London: Butterworths (1979).
- The Term Count Model, E. Garcia
- Similarity Searching Overview
- Baeza-Yates, R., Ribeiro-Neto, B; Modern Information Retrieval, Chapters 2 and 5; Addison Wesley, 1999.



