Home - Contacts - Terms -

Mi Islita

C-Indices and Measures of Associations

Understanding C-Indices, Association Measures and Co-Occurrence Patterns

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 07/24/05

Article 2 of the series Keywords Co-Occurrence and Semantic Connectivity

Topics

Events and Probabilities

Co-Occurrence Patterns

Derivation of C-Indices

Working Expressions

Measures of Associations

C-Indices and Jaccard Coefficients

Assumptions and Limitations

References

Events and Probabilities

Events that co-occur are called non mutually exclusive while those that do not co-occur are called mutually exclusive (1, 2). For n number of non mutually exclusive events E1, E2,E3...En, their union probability is given by

Eq 1:  Co-Occurrence

where


For three non mutually exclusive events (n = 3)

Eq 2:  Co-Occurrence

Eq 3:  Co-Occurrence

Eq 4:  Co-Occurrence

and Eq 1 reduces to

Eq 5:  Co-Occurrence

However, if E3 does not co-occur with E1,

Eq 6:  Co-Occurrence

Similarly, if E3 does not co-occur with E1 and E2,

Eq 7:  Co-Occurrence

which describes the union probability of two non mutually exclusive events (n = 2).

Co-Occurence Patterns

Table 1 shows the corresponding Venn Diagrams for these co-occurrence scenarios. The clustered scenario consists of three cases of pairwise co-occurrence and one case where three events co-occur. The table also shows second- and first-order pairwise co-occurrences. Note that E1 and E3 do not co-occur, but are present while "in transit" with E2. This is an example of pairwise, second-order co-occurrence.

We can go from a second- to a third-order co-occurrence scenario by considering a fourth event E4 co-occurring with either E3 or E1, but not with E2. For example, if E4 co-occurs with E3 then E1 and E4 are transitively related, exhibing third-order co-occurrence. In general high-order co-occurrence scenarios are obtained by considerig additional events. The expressions in transit co-occurrence or transitive co-occurrence are normally used in reference to second- and high-order co-occurrences.

The last scenario shown in the table has been fairly studied in IR and is an example of pairwise, first-order co-occurrence.

Table 1. Types of Event Co-Occurrences
 Co-Occurrence

We can make some approximations based on the degree of overlap between events.

For example, three non mutually exclusive events in which two events are poorly co-occuring can be approximated to a second-order co-occurrence scenario. An example of this would be the case of three co-occurring terms in which two are poorly co-occurring synonyms. Since synonyms rarely co-occur in documents but are found in the same context, this explains why these tend to exhibit poor first-order co-occurrence. In general second and high-order co-occurrence models are recommended in disambiguation studies and with terms that have a synonymity or polysemy nature.

Kontostathis and Pottenger (Lehigh University) have discussed the role of second-, third-, and higher-order co-occurrence in the paper Detecting Patterns in the LSI Term-Term Matrix (3). The authors found that patterns of transitive co-occurrence can indeed show up in LSI term-term matrices. Their findings suggest that long-range patterns of transitive co-occurrence play an important role in topic disambiguation and contextual analysis. Table 2 shows several examples exhibing such type of transitive co-occurrence. For clarity, I have removed all labels.

Table 2. Examples of Long-Range Patterns Exhibing Transitive Co-Occurrence
 Long-Range Patterns

In Towards Linguistic Steganography (4), Bergmair uses one of these pattern to discuss word disambiguation, lexical semantics and contextual meaning.

While Venn Diagrams allow users to understand transitive co-occurrence, there are better tools and projects designed to visualize co-occurrence relationships. One of such projects is DOPE, Drug Ontology Project for Elsevier (5). This project is aimed at investigating the possibility of providing access to multiple information sources in the area of life science through a single interface, the DOPE browser. DOPE generates co-occurrence patterns that resemble nerve cell connections. The patterns facilitate the identification of term-term relationships as well as concept discovery.

Derivation of C-Indices

Now that we have discussed the probabilistic nature of non mutually exclusive events and the transitive nature of high-order co-occurrences, we need to find some practical uses for these concepts. In order to work within a reference environment we need to normalize the co-occurrences.

This can be accomplished by taking intersection/union ratios; i.e., by dividing co-occurrence probabilities by the union probability (Eq 1),

Eq 8(a, b, c):   Co-Occurrence

Thus, for three non mutually exclusive events the following co-occurrence indices ("c-indices") are obtained

Eq 9:  Co-Occurrence

Eq 10:  Co-Occurrence

where the c-indices are given by

Eq 11(a, b, c, d):  Co-Occurrence

However, if E3 does not co-occur with E1, dividing Eq 6 by Eq 1 gives

Eq 12(a, b):  Co-Occurrence

Similarly, if E3 does not co-occur with E1 and E2, dividing Eq 7 by Eq 1 gives

Eq 13:  Co-Occurrence

Note from Eq 9 the additive property of pairwise co-occurrences,

c12 + c23 + c13
c12 + c23

Working Expressions

If we define an event E as the set of documents retrieved by querying a search engine for a term k, the following working expressions can be proposed

Eq 14: Co-Occurrence

Eq 15: Co-Occurrence

Eq 16: Co-Occurrence

Eq 17: Co-Occurrence

Eq 18: Co-Occurrence

Eq 19: Co-Occurrence

Eq 20: Co-Occurrence

where

n1 = set of documents retrieved by querying k1
n2 = set of documents retrieved by querying k2
n3 = set of documents retrieved by querying k3
n12 = set of documents retrieved by querying k12
n23 = set of documents retrieved by querying k23
n123 = set of documents retrieved by querying k123


Before proceeding any further I would like to revisit some measures of associations used in the IR literature.

Measures of Associations

According to van Rijsbergen (6) similarity is a measure of the association or relatedness between objects characterised by discrete-state attributes. Similarity scores quantify the likeness between objects assuming these can be grouped into clusters with similar attributes.

Some measures of associations are:

The first three association measures have been used extensively with term weight and term vector models (7). The last measure, the Jaccard Coefficient (JC),

  1. characterizes the association strengths of paired data (6).
  2. has been used as a normalized correlation factor ci, j between pairs of terms (or stems), ki and kj (8).
  3. in molecular similarity studies, has been referred to as the Tanimoto Coefficient (9).

Although similar in form (see Eq 13 and Eq 20), c-indices are not Jaccard Coefficients. In the next section I discuss the main differences between these two metrics.

C-Indices and Jaccard Coefficients

The fact that c-indices and Jaccard Coefficients are of the same form may lead some to assume that these are identical measures. This is not the case since

A c-index as defined by Eq 13 and Eq 20 reduces to a Jaccard Coefficient only when single-order co-occurrences are computed and when results are not affected by ordering; i.e., when c12 = c21. Such restrictive scenario is very rare with query sensitive co-occurrence data extracted from search engines since

Precisely, these are the very same reasons of why sometimes searches conducted in FINDALL mode -in which order and proximity should not matter- produce different set of results. Far more different set of results are obtained when the searches are conducted in EXACT mode, where order and proximity does matter.

Unlike Jaccard coefficients, c-indices can be used to examine query-insensitive and query sensitive co-occurrence data. A query-insensitive scenario would be the case of co-occurrence data extracted from a document or from an IR system that is not affected by term transposition.

A query-sensitive scenario would be the case of co-occurrence data extracted from an ever changing corpus or universe. This makes the proposed c-index metric suitable for discovering studies in which co-occurence data must be extracted from search engine result pages (SERPs).

Assumptions and Limitations

C-indices can be used to discover query-sensitive co-occurrence data from commercial search engines. This means that results would be as good as permitted by the queried system.

If the target system is affected by precision and recall issues or if we are dealing with a concept-driven system -a machine that matches queries to concepts rather than to terms- the assumption that

Eq 23: set of documents retrieved = set of documents containing the queried term k

introduces an experimental error. The magnitude of this error could be minimized by querying a system of acceptable precision/recall performance and by instructing the system to return only documents with the queried terms. This could be accomplished with advanced search commands and when required by conducting all searches in EXACT mode.

Next: Targeting Documents and Terms Through Co-Occurrence Data

Prev: Keywords Co-Occurrence and Semantic Connectivity

References
  1. Rozanov, Y.A., Probability Theory: A Concise Course, Chapter 4; Dover, 1977.
  2. Kurtz, Max; Handbook of Applied Mathematics for Engineers and Scientists, Chapter 9; McGraw Hill, 1991.
  3. Detecting Patterns in the LSI Term-Term Matrix, April Kontostathis and William M. Pottenger.
  4. Towards Linguistic Steganography, Richard Bergmair (2005).
  5. DOPE, Drug Ontology Project for Elsevier
  6. Information Retrieval; C.J. van Rijsbergen, Chapter 3, Second Edition, Online Version, London: Butterworths (1979).
  7. The Term Count Model, E. Garcia
  8. Similarity Searching Overview
  9. Baeza-Yates, R., Ribeiro-Neto, B; Modern Information Retrieval, Chapters 2 and 5; Addison Wesley, 1999.

Thank you for using this site.
Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -