Home - Contacts - Terms -

Mi Islita

On-Topic Analysis: Online Discovery of On-Topic Terms

"On-Topic Analysis is a procedure for identifying top, broader, narrower and optimum terms. These terms are then used to improve the semantics of themes, focused document, and keyword-driven marketing campaigns."

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 05/13/05 | Written on: 10/06/04

Topics

Abstract

Introduction

Background

Relevance Feedback, Terms Clustering, and Local Feedback | LCA

Passage Segmentation | Thesaurus Data Structures | Keyword Lists

Procedure

Query Conditions | Sampling | Data Acquisition | Data Processing | Results

Implementation | Top Terms | Terms Extraction | N and Pi values

Broader Terms | Narrower Terms | Off-Topic Terms | Comparative Work

Discussion

Applications | Optimum Terms | Specific Applications

Patent Resources | Electronic Resources | Implications

Visualization of Term Distances | Disambiguation | Limitations

Conclusion

Acknowledgements

Appendix

References

Abstract

This experiment discusses a procedure for the online discovery of on-topic terms. Discovery is based on topic analysis and co-occurrence theory. It is demonstrated that on-topic analysis is a valuable tool for enabling users to enhance the semantics of theme sites and concept-focused documents. Specific applications for search engine marketing strategies and information retrieval systems are presented.

1. INTRODUCTION

Single-theme websites consist of a main concept and topic-specific pages. These pages tend to point to other subject-specific pages. The thematic architecture of these sites is

Theme > Topics > Subjects,...

Larger sites tend to have multiple themes, each one with their corresponding hierarchical structures. In general, theme sites are designed in such a way that users are presented with more focused documents as they browse "from top to bottom". The structure of a site consisting of one theme, few broader topics and some specific subjects can be represented by a data structure of the form

Top > Broader > Narrower...

where Top are theme terms, Broader are topic terms and Narrower are subject-specific terms, all assumed to be on-topic terms. On-topic terms are therefore key terms associated with a document or architecture of documents. For example, if the theme of a site is mexican food, topic-focused pages could be about mexican recipes, mexican cooking, mexican ingredients, etc, while more specific subject pages could be about things like tortillas and burritos. A portion of this site can be represented as in Figure 1 and by the sequence

mexican food > mexican recipes > tortillas, burritos...

Mexican Food

Figure 1. Partial structure of a theme site.


The figure does not exclude the possibility that a given subject could be reached through different topics or that a subject-specific page could lead to narrower subject-specific pages. Evidently, a theme structure restricts word usage throughout pages since candidate words must qualify locally and globally; i.e., broader and narrower terms must be on-topic without blurring the theme. Thus, discovery and selection of on-topic terms is a non-trivial problem.

In this experiment, we present a Web-based procedure for the online discovery of on-topic terms. Our approach is based on term occurrence and co-occurrence information and is inspired in discovery procedures used by well-established IR techniques.

2. BACKGROUND

There are several IR techniques for discovering terms (1 - 9). However, not all of these techniques are accessible to Web users or are suitable for online discovery of terms. Some techniques perform well under controlled IR lab conditions but not on the Web, a commercial environment with all sort of vested interests and content alliances.

2.1 Relevance Feedback, Terms Clustering, and Local Feedback

Relevance feedback -developed in the 70's and 80's- is hard to implement by average Web users (1). In relevance feedback, the user submits a query and examines the top N ranked documents (typically the top 1 - 30 results), marks those considered relevant and non-relevant, and selects those considered important based on term occurrences (frequencies) in the relevant documents and in the non-relevant documents (1). The discovered terms are then added to the query and the cycle is repeated. Baeza-Yates and Ribeiro-Neto (5) recommend that "any experimentation involving relevance feedback strategies should always evaluate recall-precision figures relative to the residual collection." Thus, implementation is a formidable task for average Web users that at each cycle must decide which documents are considered relevant or non-relevant.

Terms clustering cannot help either. This is a technique pioneered in the 60's and 70's by Sparck Jones. Terms are discovered and grouped into clusters based on their co-occurrences and the clusters are used for query expansion (2 - 4). However, this approach cannot discriminate between ambiguous terms or terms with several meanings. Thus, trying to use plain terms clustering in query expansion or as an online discovery procedure may lead to ambiguous results.

Another technique similar to relevance feedback is local feedback. In this technique, the top N ranked documents are considered relevant. A term is discovered and then added to the query based on the occurrence of the term in the top ranked documents. This criterion fails to refine a query if most of the top N ranked documents are indeed non-relevant. (3, 4).

2.2 Local Context Analysis

J. Xu and W. B. Croft (4) have developed an excellent query expansion technique called local context analysis (LCA). LCA is based on the use of expansion concepts. Baeza-Yates and Ribeiro-Neto, prefer to use the expression document concepts (5). Expansion concepts or document concepts are noun phrases; that is, noun groups consisting of one, two or three adjacent nouns. For instance, the phrases food recipe and California car insurance are expansion concepts but cheap hotel and exciting hot vacations are not.

LCA discovers expansion concepts as follows. Instead of examining entire documents, concepts are extracted from passages of fixed length L; typically from a fixed-length window of 300 words. Concepts are ranked by their co-occurrence with the query terms in the top N ranked documents. The highest ranked concepts are then used for query expansion. The model effectively applies a novel tf*idf treatment and co-occurrence theory to a subset that is part of a global set (4, 5). This combination of local and global techniques produces more efficient query expansion. Nouns are used because research suggests that these are more informative and provide more possibilities for expanding queries than other type of terms (6).

Xu and Croft have found that the larger the number of co-occurrences, the less likely that an expansion concept co-occurs with the query by chance. They also studied the effect of L and N and found that LCA is less sensitive to the number of passages/documents used than local feedback. One limitation of LCA is that its implementation is based on fixed-length document segmentation, typically using passages of 300 words. This is done to overcome the difficulty of length normalization and to improve retrieval performance with documents of different lengths.

2.3 Document and Passage Segmentation

However, according to a recent paper by a Microsoft research group (7), the main shortcoming of fixed-length segmentation techniques is that no semantic information is taken into account in the segmentation process. Clean cuts every L words and without transition margins can affect the flow and presentation of semantics. Because of this, some prefer to use discourse passage techniques or semantic segmentation techniques. Discourse passage segmentation takes into account document punctuation. Semantic passage segmentation consists in partitioning a document into topics or sub-topics according to its semantic structure and flow (7, 8). However, these techniques must deal with the problem of length normalization.

Before using any query expansion or segmentation technique for term discovery, one must pay attention to the following. First, query expansion requires the use of long queries, while on the Web we normally use short queries, typically consisting of 2 to 3 terms. Second, the goal of query expansion is to find terms for describing relevant documents not to find terms for refining theme structures. However, it could be conjectured that terms discovered through query expansion could reinforce the hierarchical structure of a theme site if the initial query represents a theme or part of a theme. To illustrate, compare Figure 1 with the following query expansion

Input query: mexican
Expanded query 1: mexican food
Expanded query 2: mexican food recipes
Expanded query 3: mexican food recipes tortillas
Output query: mexican food recipes tortillas burritos

With regard to passage segmentation, it should be pointed out that Web documents often exhibit different type of discourses, flow of ideas, lengths or DOM structures. Unlike IR systems under lab conditions, the Web is highly susceptible to commercial interests and all sort of content aggregation alliances. Web content is frequently updated, linked to other content, or even manipulated for the sole purpose of positioning the documents at the top of the search results. Thus, terms extracted from the top N ranked Web documents or top document passages not necessarily convey more precise information than terms extracted from the top N titles.

2.4 Thesaurus Data Structures

French, Olsen and Martin have proposed an interesting thesaurus approach to the problem of discovering and suggesting terms to searchers (9). Their model is based on a thesaurus data structure that conceptually links terms with appropriate relationships using the general term organization

Top > Broader > Narrower > Related (Arbitrary)

Although applied to searches, this data structure strongly resembles the hierarchical structure found in theme sites. According to their model, Top are terms at the top of the hierarchy. These are the broadest terms containing a concept. Broader are immediate predecessor terms in the hierarchy. These are followed by narrower terms. Narrower are specific terms and are followed by related terms. Related are arbitrary terms in the thesaurus structure and are used to improve navigation. For example, a search for global change could guide searchers through the sequence

global change > air pollution > air quality > sulfur dioxide.

Ideally, Web users would like to be guided through well-resolved data structures for any given top term (e.g., a theme). However, word associations on the Web are not just the result of dictionary definitions or thesaurus data structures. Word associations are also the result of common usage, regionalisms, marketing strategies, or language peculiarities. Thus, trying to guide average Web searchers through thesaurus data structures or trying to extract well-resolved hierarchical structures from Web documents is a formidable challenge.

2.5 Keyword Lists

An alternative to relevance feedback, query expansion, segmentation, or thesaurus-based structures could be the use of pre-qualified lists of terms. Also known as keyword lists, these are free and paid services offered by search engines and search marketers. Terms are typically qualified through a given metric (popularity, most frequently searched terms, most frequently clicked terms, search logs and similar metrics) (10 - 17). Wisely used, these lists are excellent term discovering and theme building tools. However, before using these lists, online searchers must keep in mind that terms that have been pre-qualified by such metrics not necessarily are on-topic or qualify for the intended theme.

In this experiment, we present a procedure for the online discovery of on-topic terms. We refer to our procedure as on-topic analysis. We also investigate if well-resolved hierarchical data structures can be extracted from online searches. The rest of this paper is organized as follows: Section 3 explains the experimental conditions, sampling, and data acquisition software. Section 4 describes test results. Section 5 discusses applications and limitations, the implications of the experiment, and suggests future work. Section 6 summarizes and draws conclusions.

3. PROCEDURE

3.1 Query Conditions

Considering that average users tend to search in default modes, all searches were conducted in Google using its default query mode. With Google and most search engines, this mode is FINDALL (also known as AND). Terms specified in this query mode must appear in the retrieved documents. The terms can be anywhere in the documents and without regard for order and proximity. However, since search engines tend to assign high importance to terms placed in titles, queried terms are usually found in the title of the top N ranked documents (and at the beginning of these). In this experiment, we define the top N most relevant documents as the top 30 ranked results. This definition is consistent with previous relevance feedback and LCA (5, 6) research.

3.2 Sampling

In addition to pre-selected queries, we requested candidate queries from five search engine optimization specialists (SEOs). To establish equal query conditions the specialists were asked to submit key phrases describing a theme of the form Q = k1 + k2, where each k was a single, space-delimited, English term. To insure that k1 and k2 shared some degree of co-occurrence in the Google database, we accepted key phrases with a co-occurrence index (c-index) equal or greater than an empirical threshold. Appendix A and reference 17 discuss term co-occurrence and c-index theory. In the absence of a standard reference point, we qualified queries based on our experiences with Google. We have found that many competitive two-word queries have c-index values around 25 parts per thousand (ppt) in Google. Accordingly, we used key phrases with c-index values equal or greater than 25 ppt.

3.3 Data Acquisition

All experimental data was acquired and processed with our Semantic Analyzer (SA). This tool includes a parser, a word counter, a stop list, a library of regular expressions, and several similarity calculators (Cosine, Dice, Salton's Index, etc). The tool also has a tf*idf, Zipf rank, c-index and EF-ratio calculator. SA can be programmed to exclude or include stop words and sort results according to occurrence probabilities.

In a typical analysis we query Google with a theme phrase, retrieve the top N ranked documents, and extract a passage from each document. In this experiment, a passage is defined as the title of a document. Since not all titles are of the same length, we must deal with the problem of selecting optimum title lengths.

Fortunately Google limits the displaying of titles to fewer than 68 characters (including spaces). Some search engines use other lengths. The W3C suggests keeping titles to less than 64 characters (18). Many SEOs prefer to keep titles to a 60- to 50-character limit to ensure they display in their entirety in all engines. The consensus is that short titles convey more precise information than long titles, especially when relevant terms are placed at the beginning of the title. For our on-topic analysis, we limit each title to the visible title displayed by the retrieval system, whatever that length may be.

All passages are imported into our software and converted into a stream of lower case terms. The stream is tokenized and stop words are removed. We do not use any stemming method. Next, unique terms are counted and stored in a term array as terms are discovered. Finally, unique terms are sorted according to their occurrence probabilities PI, where

Pi = Fi/Ftotal

Fi is the occurrence of the term i and Ftotal is the sum of all the occurrences. Terms with same probabilities are listed according to their positions in the stream of terms; thus, terms listed at the very bottom of the Pi lists occur in the least relevant titles. For each set of extracted terms, we compute the number of unique terms per title and total number of terms per title. To improve presentation, terms are uppercased and listed in HTML-generated tables. The entire process is seamless and transparent to end-users.

3.4 Data Processing

We now discuss computation overhead and processing speed. Retrieval speed is determined by the queried system. In a typical analysis, text processing, stop words removal, Pi computations, and generation of sorted HTML tables take few seconds. Since the tool is browser-based, processing speed is limited by the ability of the browser to process large N values and interpret long HTML tables.

Our software handles on-demand discovery of terms for any practical N value and theme. Currently, the c-index calculator computes c-index values for two and three co-occurring terms. We plan to implement a matrix-based calculator that would handle any number of c-index calculations for any number of terms.

4. RESULTS

4.1 Implementation

We now discuss the implementation of our on-topic analysis. First, we present results for top terms and discuss the effect of N on Pi. Next, we present results for broader, narrower, and off-topic terms. Finally, we compare results with a third-party tool.

4.2 Top Terms

Figure 2 shows c-index data for previously selected themes (Tables 2 - 4, 10) and themes submitted by SEOs (1, 5 - 9). Occurrence and co-occurrence results change over time since Google is constantly upgrading its index. However, c-index trends can be extracted from time series charts.

c-Index Data

Figure 2. c-index values for several themes.


In Figure 2, the n1 and n2 quantities are number of documents containing the k1 and k2 terms, respectively, and are taken for term occurrences. The n12 quantities are number of documents containing both k1 and k2 and are taken for absolute term co-occurrences. The last column of the table shows the c12-indices, in parts per thousand (ppt). Figure 3 shows a Venn Diagram representation of occurrences and co-occurrences.

Venn Diagram

Figure 3. Venn Diagram for two non-mutually exclusive set of results, n1 and n2.


The expression for the c12-index discussed in Appendix A can be derived directly from the intersection/union ratio of the Venn Diagram. Note that a c-index is a normalized co-occurrence. For the purpose of comparing term co-occurrences between different queries and set of documents, this is a better measure than absolute co-occurrence values since different set of queries (k1, k2, and k12) retrieve different sets of documents (n1, n2, and n12). To illustrate, note from Figure 2 that car insurance returns 12,500,000 documents and football odds returns 2,790,000 documents. However, the normalized co-occurrences are 39.49 and 56.71 ppt, respectively.

4.3 Terms Extraction

For all queries, we computed results for the top 30 ranked titles. However, for comparison purposes, we also computed results for the top 100 and 200 titles. The objective was to determine the impact of N on the relative distribution of Pi values and to observe if off-topic terms and narrower on-topic terms can be resolved based solely on Pi values.

To test our procedure we first collected results for the phrase mexican food (i.e., k1 = mexican and k2 = food). These are shown in Table 1. RI is the position of term i in the array of unique terms and Pi is expressed as a percent. The table shows that

These results suggest that terms such as recipes, cuisine, cooking are on-topic broader terms and could be used to improve the semantics of a theme site or Web document about mexican food.

4.4 N and Pi values

We now check the impact of N on the distribution of terms. As N is increased from 30 to 100 and to 200

We have consistently observed similar results with other queries. This is illustrated with previously qualified queries (see Tables 2, 3 , and 4), and with queries submitted by SEOs (see Tables 5, 6, 7, 8, and 9).

4.5 Broader Terms

We now discuss some facts about broader terms. Table 4 (auto insurance) and Table 5 (car insurance) show results for two queries describing the same concept; i.e., automobile insurance. The tables reveal several similarities and differences. First, the terms car, insurance, auto, quotes, and quote are found at the top of the lists. Thus, these are broader terms for the corresponding theme.

For N = 30, the two tables list many dissimilar terms. Moreover, note that the term UK, consistently found at the top of the lists in Table 5 (car insurance), is not found in the top 30 and 100 results of Table 4 (auto insurance). The fact that UK is consistently found in the top 30, 100, and 200 titles relevant to car insurance could be the result of a well-organized geo-targeting strategy.

This illustrates that a term found at the top of the Pi lists is not always on-topic. Thus, one must discriminate between on-topic and off-topic terms with high Pi values. Indeed, a query consisting of loosely connected terms could produce probability lists with off-topic terms at the top of the lists.

For example, most searchers associate the word aloha with Hawaii rather than with Indiana or Montana. This is a semantic association due to common usage; not a synonymity association based on dictionary definitions or on thesauri data structures. As expected, c-index calculations for aloha hawaii, aloha indiana, and aloha montana agree with the general notion that aloha is more "connected" with Hawaii than with Indiana or Montana; i.e.

c12-index (aloha hawaii) = 29.67 ppt
c12-index (aloha indiana) = 3.16 ppt
c12-index (aloha montana) = 3.99 ppt

Table 10 shows Pi values for the top 30 ranked titles for these queries. The query aloha hawaii discovers many on-topic terms while the aloha indiana and aloha montana queries return too many off-topic terms at the top of the lists. In general, loosely connected, off-topic terms tend to return more off-topic terms.

4.6 Narrower Terms

We now address the question of qualifying and discovering narrower terms. In this experiment, we have used the c-index concept as a qualifying metric for terms part of a theme. Broader terms were combined with theme terms and qualified with the c-index metric.

However, this approach rarely works with on-topic narrower terms. These terms tend to produce low c-index values when combined with theme or broader terms. This is not surprising. Compared with theme and broader terms, narrower terms tend to have lower occurrences. Thus, a c-index calculation would involve addition, subtraction, and division between relatively large and relatively small quantities.

The simplest way to qualify on-topic narrower terms consists in combining them with other on-topic narrower terms and checking their c-index values. This approach helps with the qualification of narrower terms and leads to the discovery of new on-topic narrower terms. To illustrate consider the queries tortillas burritos (c12-index = 36.47 ppt) and tamales fajitas (c12-index = 30.01 ppt). Based on their c-indices, these are qualified on-topic narrower terms. Table 11 shows results for these queries. Note that new narrower on-topic terms are discovered. It can be observed that these terms "gravitate" around terms such as recipes, mexican, restaurant, and similar terms.

4.7 Off-Topic Terms

We now address the problem of discriminating between off-topic and on-topic terms found at the top of the Pi lists. For this purpose, we present three different approaches, herein referred to as Method I, Method II, and Method III. In Method I, we compute a c-index value between the intended two-word theme and a candidate term. If the result is below a given threshold value the term is rejected. The working expression for calculating a c-index for three co-occurring terms is described in Appendix A. This expression can be derived directly from the Venn Diagram shown in Figure 4.

Venn Diagram

Figure 4. Venn Diagram for three non-mutually exclusive set of results, n1, n2, and n3. Note that the normalized co-occurrence (given in terms of a c-index value) is affected by a cluster of occurrences and co-occurrences.


We now discuss Method II. In this method we redefine the two-word theme as a single query term and calculate a pairwise c12-index value. This is accomplished by enclosing the two-word theme in quotes. For example, to examine the pairwise c-index for UK and car insurance we define the following terms

k1=UK
k2="car insurance"
k12=UK "car insurance"

However, this method imposes EXACT query conditions on the car insurance terms since now these must be found in the documents with regard for order and proximity.

In Method III, we combine theme terms with a candidate term and calculate pairwise c-indices. For multiple terms and long themes, the use of an "nxn" co-occurrence matrix insures that all possible k12 combinations are checked. However, often this method leads to the calculation of c-index values for combinations clearly off-topic or trivial. This is usually the case of combinations involving non-nouns. For example, in Table 7 (used cars) the term new is found at the top of the Pi lists. Computing a c-index for new used is a futile exercise and confirms Xu and Croft's assertion in the sense that nouns are more informative than other types of terms (4).

Method I, II, and III are described in Figure 5. For illustrative purposes, the term UK and the car insurance theme are inspected.

Methods I, II, and III

Figure 5. Discriminating methods.


Method I and II reveal that UK is loosely connected with the car insurance theme and thus must be disqualified as a topic term. To compare results we have checked other candidate terms found in Table 5. Method III produces c-index values closer but still below the threshold value.

4.8 Comparative Work

We now compare our results with results obtained with the Google AdWords Keyword Tool (10), an online tool available from Google. Before proceeding any further, we must point out that any comparison must be place in the right perspective for two reasons. First, the Keyword Tool discovers terms based on popular searches while our procedure discovers terms from top titles.

Second, our procedure was designed to discover terms from passages from any search engine, while the Keyword Tool was designed to discover terms somewhat associated with ads placed in Google; thus, discovery is conditioned by search behaviors and ad activities in Google. Therefore, any comparison in terms of performance or c-indices would be unsustainable. Rather, we want to compare which terms are actually discovered by the two tools.

The Keyword Tool provides two types of results, labeled as More Specific Keywords and Similar Keywords. According to Google, More Specific Keywords returns "popular queries" that include the specified term(s). Similar Keywords lists expanded broad matches and additional keywords. According to Google, users who search for the specified query also search for the terms listed as Similar Keywords.

Table 12 shows results for the mexican food query using the Keyword Tool. Unless a user is familiar with the tool, it is hard for a first-time user to realize if the results are unsorted or sorted (according to a given parameter). A comparison between Table 12 and Table 1 shows that many of the same terms listed by the Keyword Tool are also listed and sorted by our tool.

Despite of its popularity, the Keyword Tool has several limitations. The first obvious one is that term discovery is somehow conditioned by search behaviors and ads in Google. Second, some combinations of terms produce partial results or no results at all. This is understandable since the Keyword Tool was designed to improve the relevance of terms associated with ads, not to work as a general-purpose term discovering tool.

Table 13, shows results using the Keyword Tool. At the time of writing this paper the tool did not return any result for the query nigritude ultramarine. A search for integrated optimization returned results in Similar Keywords but not in More Specific Keywords. We have observed similar results with other queries. Thus, online discovery of on-topic terms is not always possible with this tool. Table 14 (nigritude ultramarine; c12-index = 382.80 ppt) and Table 15 (integrated optimization; c12-index = 43.81 ppt) show results obtained with our tool. Note which on/off-topic terms are discovered.

5. DISCUSSION

5.1 Applications

In this section we introduce the concept of optimum terms and term distances. Next, we discuss several possible applications for our on-topic analysis. After that, we elaborate on the limitations of our procedure and discuss future work.

5.2 Optimum Terms

There are many tools that provide results based on searchers' behaviors, somehow associated with ads or based on search logs (e.g., the Google Keyword Tool, Overture's Term Suggestion Tool, Wordtracker, etc). However, these tools do not provide users with occurrence or co-occurrence data. For instance, the Google Keyword Tool does not tell users

  1. which individual terms are used the most with popular queries.
  2. which terms targeted by searchers are also targeted by top N ranked results.

Point 1 can be addressed by importing results obtained with the Keyword Tool into our tool and by sorting results according to Pi values. This is demonstrated in Table 16 with the queries madonna and paris hilton. The table shows results listed as More Specific Keywords by the Keyword Tool. According to Google, these "are popular queries" containing the terms madonna and paris hilton. Which terms are used the most with these popular queries? Our tool reveals that these terms are lyrics and nicky, respectively.

Let us now address Point 2. Terms that are both frequently searched and frequently found in the top N ranked titles are terms relevant to both information seekers and ranking algorithms. These terms tend to show near or at the top of popularity- and ranking-based Pi lists. While some terms could appear in both lists, few are found near or at the top of both lists. Many of these terms are also c-index qualified and tend to be good return-on-investment performers. We refer to these terms as optimum terms. Our procedure and software identifies candidate optimum terms in a straightforward manner.

Table 16 indicates that the terms lyrics and nicky are used the most with the queries madonna and paris hilton. Are these terms co-occurring the most with these queries in the top 30 titles? Table 17 reveals that this is the case of lyrics but not of nicky. Thus, lyrics is a candidate optimum term for madonna. A comparison between tables suggests that the optimum term for paris hilton is not nicky but pictures, which appears near the top of both Pi lists.

5.3 Specific Applications

5.3.1 Patent Resources

In addition to Web marketing, our procedure can be used to discover terms from any IR systems that return titles. The United States Patents and Trademarks Office (USPTO) uses one of such systems. However, on-topic analysis in the USPTO system is challenging since patent titles are often descriptives. Terms such as method, system, apparatus, data, process, and derivative terms are frequently found in these titles. Thus, these descriptive terms usually show at the top of the occurrence lists with many patent searches. For all practical purposes, these terms have no discriminating power and could be taken for stop words.

Table 18 shows partial results in which we queried the USPTO database for patent titles from 1976 to present. The query submitted was network management. Only the top 30 ranked titles were inspected. We conducted three types of searches. First, a quick search was conducted by entering the two terms (k1=network and k2=management) in separate search fields. In the USPTO system this is equivalent to conducting a FINDALL (AND) query. Second, a quick search was conducted by entering k1 and k2 in one field. In the USPTO system, this is equivalent to conducting an EXACT query (a search enclosed by quotes). Finally, we searched with the USPTO command ttl/(k1 AND (k2)). Descriptive terms are found at the top of the results. Our experience with these and similar queries is that queried terms and on-topic terms are found at the top of the occurrence list when the search is conducted with the ttl/(k1 AND (k2)) command. The reason is that we are restricting the search to patent titles containing k1 and k2.

5.3.2 Electronic Resources

On-topic analysis could be applied to other systems that return titles such as financial/legal/court systems, electronic libraries, abstract services (e.g. Citeseer, Chemical Abstracts, Dissertation Abstracts, etc), news headlines, and similar services.

Our procedure could also be incorporated into any fast pace homeland security environment in which quick cataloging, information profiling, and pattern mining is necessary. The prescribed on-topic analysis could be as follows.

  1. A system is queried for some profiling terms or theme and occurrence distributions are obtained.
  2. Next the normalized co-occurrences of broader terms are examined.
  3. Terms that qualify for a given c-index threshold are then queried.
  4. New Pi values are obtained and the cycle is repeated.

The objective is to reach a data structure organization strongly connected. In the next section, we suggest a method aimed at visualizing such structures.

5.4 Implications

5.4.1 Visualization of Term Distances

Normalized co-occurrences may have some implications in the area of data structure visualization. For example, c-indices can be used to produce a visual representation similar to the tree diagrams obtained with InPharmix's PDQ_MED (19). PDQ_MED provides a graphical representation in which the inverse of the frequency of co-occurrences is represented as the linkages between terms. As a modification of this concept, the following transformation could be employed for a pair of terms

d = log (1/c-index)

where d is the distance between terms and the c-index is not expressed in ppt but as a fraction. Thus,

d = log ((n1 + n2 - n12)/n12)

A sample diagram is illustrated in Figure 6

Term Distances

Figure 6. Term distances for a theme site about mexican food. Distances not at scale and are given in arbitrary units.


The diagram shows which terms are strongly and loosely connected. Terms strongly connected are separated by shorter distances. The extreme case of c12-index = 1 means that n1 = n2 = n12 in which case d = 0. Note that food and recipes are on-topic broader terms and tortillas and burritos are on-topic narrower terms. Both type of terms are strongly connected in this hierarchical structure.

5.4.2 Disambiguation and Suggestion

Our on-topic analysis has some implications for query disambiguation. Consider a simple query such as madonna. A search for this term does not tell an IR system if a user is searching for Madonna College, Madonna Hospital or Madonna the singer, unless the system is provided with on-topic terms.

An IBM research team has studied the problem of disambiguation (20). According to this group "...disambiguation can be achieved by relying on the presence or absence of additional terms that appear in the context of a subject. The basic premise is that the user is interested in a particular domain, which may be identified by a particular vocabulary of on-topic terms and off-topic terms." The IBM approach requires a good understanding of tf*idf theory from the user.

Our approach is quite different, is not tf*idf-based, and is end-user oriented. Our procedure identifies on-topic terms, which can then be appended to any initial query. Thus, a search for madonna lyrics is a search for Madonna the singer.

On-topic analysis can also be run in the background of an IR as a suggestion tool. Consider a user searching for a query X in which the results page contains the message "User searching for X also search for Y1, Y2, Y3,...", where Y1, Y2, Y3,... are terms discovered with X and listed according to c-index or Pi values.

5.5 Limitations

We now discuss the main limitations of our experiment and future work. We have limited our experiment to Google and to terms with co-occurrence values above the 25 ppt mark. However, there are no theoretical arguments for disqualifying terms based on any given co-occurrence threshold. Terms simply co-occur or do not.

We also limited our analysis to visible titles. In future work we plan to conduct experiments across several search engines and at different c-index thresholds. We also plan to conduct on-topic analysis using Google's allintitle command. While this should improve our results (searches are limited to titles containing the queried terms) this approach does not account for average users' behaviors since most searchers use default modes.

Last but not least, we plan to conduct on-topic analysis with the information displayed in search result pages. In this case, passages can be defined as the visible entries (titles and document content) displayed by the queried system.

6. CONCLUSION

We have presented a client-side procedure for the online discovery of on-topic terms. Given a query, it is possible to extract data structures of acceptable resolution from commercial documents. On-topic analysis reveals significant information about the occurrence and normalized co-occurrence of terms extracted from the top N ranked titles. In most cases, this information is extremely helpful for discriminating between on/off topic terms and for developing theme sites and Web documents.

Our experimental procedure allows users to identify top, broader, narrower and optimum terms. These terms can then be used to improve the semantics of themes, focused document, and keyword-based marketing campaigns. On-topic analysis works because information is extracted from a subset of results whose relevancy has been pre-established by the queried system.

7. ACKNOWLEDGEMENTS

We would like to thank the following people and companies for their contributions to this experiment:

Derek Chew, from http://www.organic-rankings.com/
Ignacio "Nacho" Hernandez Jr.,from http://www.mexgrocer.com/
Barry Schwartz, from http://www.rustybrick.com/
Frank Watson, from http://www.smart-keywords.com/
Joseph Morin, from http://www.boostranking.com/
Alan Perkins, from http://www.silverdisc.co.uk/
Chris Dimmock, from http://www.cogentis.com.au/

8. APPENDIX A. c-INDEX EXPRESSIONS

For three terms, k1, k2 and k3, c-index values can be calculated with

c123-index


If only two terms are considered

c12-index


where

n 1 = number of documents containing k1
n 2 = number of documents containing k2
n 3 = number of documents containing k3
n 12 = number of documents containing k1 and k2
n 23 = number of documents containing k2 and k3
n 13 = number of documents containing k1 and k3
n 123 = number of documents containing k1, k2 and k3

We conveniently express c-index values in parts per thousand (ppt) since these are small values. To calculate c-indices for any number of terms or combination of terms in both FINDALL and EXACT modes, see reference 17.

9. REFERENCES

  1. G. Salton and C. Buckley; Improving retrieval performance by relevance feedback Journal of the American Society for Information Science, 41:288-297, 1990.
  2. K. Sparck Jones and D. M. Jackson; The use of automatically-obtained keyword classifications for information retrieval. Information Processing and Management, 5:175-201, 1970.
  3. R. Attar and A. S. Fraenkel; Local feedback in full-text retrieval systems. Journal of the ACM, 24(3):397-417, July 1977.
  4. J. Xu and W. B. Croft; Improving the Effectiveness of Informational Retrieval with Local Context Analysis http://citeseer.ist.psu.edu/cache/papers/cs/2875/http:zSzzSzwww.cs.umass.eduzSz~xuzSzlca.pdf/xu00improving.pdf
  5. R. Baeza-Yates and B. Ribeiro-Neto; Modern Information Retrieval, Chapter 5; ACM Press, 1999.
  6. Y. Jing and W. B. Croft; An association thesaurus for information retrieval. Proceedings of RIAO 94, pages 146-160, 1994.
  7. D. Cai, S. Yu, J. Wen and W. Ma; Block-based Web Search
    http://research.microsoft.com/asia/dload_files/group/ims/21.pdf
  8. Ponte, J. M. and Croft, W. B.; Text Segmentation by Topic, In Proceedings of the 1st European Conference on Research and Advanced Technology for Digital Libraries, 1997
  9. J. C. French, L. M. Olsen, W. N. Martin; Thesaurus Support when Searching Earth Science Data
    http://esto.nasa.gov/conferences/estc-2002/Papers/B6P4(French2).pdf
  10. Google Adwords: Keyword Tool
    https://adwords.google.com/select/KeywordSandbox
  11. Google Press Center: Zeitgeist
    http://www.google.com/press/zeitgeist.html
  12. Overture - Search Term Suggestion Tool
    http://inventory.overture.com/d/searchinventory/suggestion/
  13. Yahoo! Buzz Index - Today's Top 20 Overall Searches
    http://buzz.yahoo.com/overall/
  14. Ask Jeeves About
    http://sp.ask.com/docs/about/jeevesiq.html?o=0
  15. Lycos 50
    http://50.lycos.com/
  16. WordTracker
    http://www.wordtracker.com/
  17. Keywords Co-Occurrence and Semantic Connectivity Strategies
    http://www.miislita.com/semantics/c-index-2.html
  18. The Head Element and Related Elements
    http://www.w3.org/MarkUp/html3/dochead.html
  19. PDQ_MED
    http://www.inpharmix.com/pdq_med_example.htm
  20. R. Nelken, E. Amitay, A. Soffer, W. Niblack, D.C. Smith;
    Disambiguation for Text Mining on the Web
    http://www2003.org/cdrom/papers/poster/p302/final_poster/final_html_version.htm

Thank you for using this site.
Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -