Home - Contacts - Terms -

Mi Islita

C-Indices and Measures of Associations

Understanding C-Indices, Association Measures and Co-Occurrence Patterns

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 06/24/08

Article 2 of the series Keywords Co-Occurrence and Semantic Connectivity

Topics

Events and Probabilities

Co-Occurrence Patterns

Derivation of C-Indices

Working Expressions

Measures of Associations

C-Indices and Jaccard's Coefficients

Assumptions and Limitations

References

Events and Probabilities

Co-occurring events are called non mutually exclusive and those that do not co-occur are called mutually exclusive (1, 2). For n number of non mutually exclusive events E1, E2,E3...En, their union probability is given by

Eq 1:  Co-Occurrence

where


For three non mutually exclusive events (n = 3)

Eq 2:  Co-Occurrence

Eq 3:  Co-Occurrence

Eq 4:  Co-Occurrence

and Eq 1 reduces to

Eq 5:  Co-Occurrence

However, if E3 does not co-occur with E1,

Eq 6:  Co-Occurrence

Likewise, if E3 does not co-occur with E1 and E2,

Eq 7:  Co-Occurrence

which describes the union probability of two non mutually exclusive events (n = 2).

Co-Occurence Patterns

Table 1 shows the corresponding Venn Diagrams for these scenarios. The clustered scenario consists of three cases of pairwise co-occurrence and one case wherein three events co-occur. The table also shows second- and first-order pairwise co-occurrences. For instance in the in-transit scenario E1 and E3 do not co-occur, but are present while "in transit" with E2. This is an example of second-order co-occurrence.

We can go from a second- to a third-order co-occurrence scenario by considering a fourth event E4, not shown in the table, co-occurring with either E3 or E1, but not with E2. For example, if E4 co-occurs with E3 then E1 and E4 are transitively related, exhibing third-order co-occurrence. The expressions in transit co-occurrence or transitive co-occurrence are normally used in reference to second- and high-order co-occurrences.

The first-order scenario shown in the table has been fairly studied in IR and is an example of first-order co-occurrence.

Table 1. Types of Event Co-Occurrences
 Co-Occurrence

We can make some approximations based on the degree of overlapping between events.

For example, three non mutually exclusive events in which two events are poorly co-occuring can be approximated as a second-order co-occurrence scenario. An example of this would be the case of three co-occurring terms wherein two are poorly co-occurring synonyms. Since synonyms rarely co-occur in documents, but are found in similar contexts, this might explains why they tend to exhibit poor first-order co-occurrence. In general second and high-order co-occurrence models are recommended in disambiguation studies.

Kontostathis and Pottenger (Lehigh University) have discussed the role of second-, third-, and higher-order co-occurrence in the paper Detecting Patterns in the LSI Term-Term Matrix (3). The authors found that patterns of transitive co-occurrence can indeed be present in LSI term-term matrices. Their findings suggest that long-range patterns of transitive co-occurrence might play an important role in topic disambiguation and contextual analysis. Table 2 shows several examples exhibing such type of transitive co-occurrence. For clarity, we have removed all labels.

Table 2. Examples of Long-Range Patterns Exhibing Transitive Co-Occurrence
 Long-Range Patterns

In Towards Linguistic Steganography (4), Bergmair uses one of these pattern to discuss word disambiguation, lexical semantics and contextual meaning.

Whereas basic Venn Diagrams are acceptable tools for visualizing transitive co-occurrence, better tools and projects do exist. One of such projects is DOPE, Drug Ontology Project for Elsevier (5). This project is aimed at investigating the possibility of providing access to multiple information sources in the area of life science through a single interface, the DOPE browser. DOPE generates co-occurrence patterns that resemble nerve cell connections. The patterns facilitate the identification of term-term relationships as well as concept discovery.

Derivation of C-Indices

So far we have discussed the probabilistic nature of non mutually exclusive events, the transitive nature of high-order co-occurrences, and few tools or vehicles for visualizing these. We now need a normalization framework and find some practical uses for these concepts.

Taking intersection/union ratios; i.e., dividing co-occurrence probabilities by the union probability (Eq 1),

Eq 8(a, b, c):   Co-Occurrence

Thus, for three non mutually exclusive events the following intersection/union ratios, herein termed co-occurrence indices ("c-indices"), are obtained

Eq 9:  Co-Occurrence

Eq 10:  Co-Occurrence

where the c-indices are given by

Eq 11(a, b, c, d):  Co-Occurrence

However, if E3 does not co-occur with E1, dividing Eq 6 by Eq 1 gives

Eq 12(a, b):  Co-Occurrence

Similarly, if E3 does not co-occur with E1 and E2, dividing Eq 7 by Eq 1 gives

Eq 13:  Co-Occurrence

Note from Eq 9 the additive property of pairwise co-occurrences,

c12 + c23 + c13
c12 + c23

Working Expressions

If we define an event E as the set of documents retrieved by querying a search engine for a given term(s), the following working expressions can be proposed

Eq 14: Co-Occurrence

Eq 15: Co-Occurrence

Eq 16: Co-Occurrence

Eq 17: Co-Occurrence

Eq 18: Co-Occurrence

Eq 19: Co-Occurrence

Eq 20: Co-Occurrence

where

n1 = set of documents retrieved by querying k1
n2 = set of documents retrieved by querying k2
n3 = set of documents retrieved by querying k3
n12 = set of documents retrieved by querying k12
n23 = set of documents retrieved by querying k23
n123 = set of documents retrieved by querying k123


Before proceeding any further I would like to revisit some measures of associations used in the IR literature.

Measures of Associations

According to van Rijsbergen (6) similarity is a measure of the association or relatedness between objects characterised by discrete-state attributes. Similarity scores quantify the likeness between objects assuming these can be grouped into clusters with similar attributes.

Some measures of associations are:

The first three association measures have been used extensively with term weight and term vector models (7). The last measure, the Jaccard's Coefficient (JC),

  1. characterizes the association strengths of paired data (6).
  2. has been used as a normalized correlation factor ci, j between pairs of terms (or stems), ki and kj (8).
  3. in molecular similarity studies, has been referred to as the Tanimoto Coefficient (9).

C-indices are not Jaccard's Coefficients. In the next section we discuss their main differences.

C-Indices and Jaccard's Coefficients

In spite of the fact that for a two-term scenario (Eq 20) a c-index resembles a Jaccard's Coefficient, these are not equivalent measures.

Regarding the last point, usually data extracted from online repositories (e.g.,search engine results) is query-dependent. Therefore,
 Inequality
Consequently, event transposition does matter. The order in which terms are declared in a query affects the semantics (meaning) of the information need sought and, therefore, the set of retrieved results.

To illustrate, on 07/24/05 a search in Google for this work returned 812,000,000 results while a search for work this returned 651,000,000 results (try also with Google game and game Google). True that some search engines tend to return similar number of results with queries in which term transposition affects semantics. However, an inspection of those results reveal drastic differences. This is the case of querying Google for junior college and for college junior. The lesson here is that extracting Jaccard's Coefficients -or any association measure, for that matter- from query-dependent data while overlooking at term transposition is a highly questionable practice.

Eqs 13 and 20 resemble a Jaccard's Coefficient only when single-order co-occurrences are computed and when results are not affected by ordering; i.e., when c12 = c21 --and even so this does not make a c-index a similarity measure. Furthermore, such restrictive scenario (c12 = c21) is very rare with query sensitive co-occurrence data extracted from a search system since

Precisely, these are the very same reasons of why sometimes searches conducted in FINDALL mode -in which order and proximity should not matter- produce different set of results. Far more different set of results are obtained when the searches are conducted in EXACT mode, where order and proximity does matter.

Unlike Jaccard coefficients, c-indices can be used to examine query-insensitive and query sensitive co-occurrence data. A query-insensitive scenario would be the case of co-occurrence data extracted from a document or from an IR system that is not affected by term transposition.

A query-sensitive scenario would be the case of co-occurrence data extracted from an ever changing corpus or universe. This makes the proposed c-index metric suitable for discovering studies in which co-occurence data must be extracted from search engine result pages (SERPs).

Assumptions and Limitations

C-indices can be used to discover query-sensitive co-occurrence data from experimental retrieval systems and commercial search engines. The results can be as good as permitted by the queried system. If a system reformulates queries and answer sets in the background or returns partial answer sets this can impact co-occurrence estimates and the end user must be aware of that.

If the target system is affected by precision and recall issues or if we are dealing with a concept-driven system -a machine that matches queries to concepts rather than to terms- the assumption that

Eq 23: set of documents retrieved = set of documents containing the queried term k

introduces an experimental error. The magnitude of this error could be minimized by querying a system at an acceptable precision/recall performance level and by instructing the system to return only documents with the queried terms. This could be accomplished with advanced search commands, and when required, by conducting searches in EXACT mode.

Next: Targeting Documents and Terms Through Co-Occurrence Data

Prev: Keywords Co-Occurrence and Semantic Connectivity

References
  1. Rozanov, Y.A., Probability Theory: A Concise Course, Chapter 4; Dover, 1977.
  2. Kurtz, Max; Handbook of Applied Mathematics for Engineers and Scientists, Chapter 9; McGraw Hill, 1991.
  3. Detecting Patterns in the LSI Term-Term Matrix, April Kontostathis and William M. Pottenger.
  4. Towards Linguistic Steganography, Richard Bergmair (2004).
  5. DOPE, Drug Ontology Project for Elsevier
  6. Information Retrieval; C.J. van Rijsbergen, Chapter 3, Second Edition, Online Version, London: Butterworths (1979).
  7. The Term Count Model, E. Garcia
  8. Similarity Searching Overview
  9. Baeza-Yates, R., Ribeiro-Neto, B; Modern Information Retrieval, Chapters 2 and 5; Addison Wesley, 1999.

Thank you for using this site.
Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -