Home - Contacts - Terms -

Mi Islita

Document Indexing Tutorial

Document Indexing Tutorial for Information Retrieval Students and Search Engine Marketers

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 09/12/05

Article 1 of the series Information Retrieval Tutorials

Topics

About these Tutorials and Resources

Avoiding Misconceptions

Indexing

Document Linearization

Filtration

Stemming

Weighting

Tutorial Review

Acknowledgements

References

About These Tutorials and Resources

The purpose of this series is to provide search engine marketing and optimization specialists with some basic knowledge on information retrieval. I believe this is necessary given the fact that

Avoiding Misconceptions

Some marketers from the SeoRoundTable.com (1) and SeoChat.com (2) sites have incorrectly claimed that

  1. 'Search engines remove stop words as part of their tokenization process, but the stop words are important elements of natural language and most likely analyzed as part of the raw data.'
  2. '...in tokenization, "computer", "computational" and "computing" all become "comput" '

For those not familiar with IR, these statements -perhaps given in good faith- give the impression that tokenization involves stop word removal or that tokenization reduces terms to stems (roots). Wrong! By definition tokenization is neither stop word removal nor stemming.

Regarding stopword removal, once they are removed they are not considered or included in the linearized representation of the document to be scored. Regarding term variations associated to "computer", my Porter's stemmer -double checked by Dr. Martin Porter himself via email- has no problem stemming these.

The purpose of this first tutorial is to explain to SEOs, SEMs and loyal visitors of this site the difference between these processes. The following information is based in part on Professor Ian Ruthven's (University of Strathclyde, Glasgow) and Mounia Lalmas's (Queen Mary, University of London) paper, A survey on the use of relevance feedback for information access systems (3).

Let's start the discussion by defining the following expressions

  1. Indexing
  2. Document Linearization
    • Markup and format Removal
    • Tokenization
  3. Filtration
  4. Stemming
  5. Weighting

Indexing

During indexing documents are prepared for use by an IR system. This means preparing the raw document collection into an easily accessible representation of documents. This transformation -from a document text into a representation of text- is known as indexing the documents. Transforming a document into an indexed form involves the use of

  1. a library or set of regular expressions
  2. parsers
  3. a library of stop words (a stop list)
  4. other miscellaneous filters

This is normally done in five steps: markup & format removal, tokenization, filtration, stemming and weighting. If no markup removal and weighting is required, then this transformation involves only tokenization, filtration and stemming (4). This type of indexing is frequently found in databases that merely sort text files and raw data. On the Web, however, the above five steps are used especially since documents are created in different formats and relevance scores are needed.

Document Linearization

Document Linearization is the process by which a document is reduced to a stream of terms. This is usually done in two steps and as follows.

  1. Markup and Format Removal During this phase, all markup tags and special formatting are removed from the document. Thus, for an HTML document all tags and text inside these are removed. This normally would include all element attributes, scripts, comment lines and text placed into these. Some commercial search engines may retain text placed inside the TITLE tag, image ALT attribute, table SUMMARY attribute and META DESCRIPTION tag. Other systems may not care for element attributes or meta data at all. Even others may retain and serve the end user with metadata information but may not consider this for ranking purposes.
  2. Tokenization During this phase, all remaining text is parsed, lowercased and all punctuation removed. Hyphenation rules must be invoked. For instance, some systems may elect to retain hyphens while others may be designed to either ignore hyphens or interpret these as spaces or as join tokens. From the query side, some search engines such as Google seem to interpret hyphens in queries as a kind of localized EXACT search (e.g., in a FINDALL query of the form k1-k2 + k3 the k1-k2 portion of the query is interpreted as a localized EXACT query within the FINDALL mode). Strange alphanumeric characters are normally removed during tokenization.

During linearization all Cascading Style Sheets (CSS) instructions are removed. This means that without a clear understanding of text linearization processes, the arbitrary repositioning of content using CSS, nested division tags or tables can indeed be detrimental in the sense that what users perceive as relevant content is not what the search engine is actually "reading" and scoring as relevant. In fact, after document linearization

  1. the effective flow of the text should describe a coherent stream of terms.
  2. this stream of text must capture the intended semantics, theme, topics, subtopics, etc of the document.
  3. the position of terms in the text stream is determined by how the markup lines (e.g., HTML tags) were declared in the source code.

All this underscores the fact that users's perception of relevancy (e.g., the information and its meaning displayed before users) and machine perception of relevancy (relevance scores) are two different things. More than often the way search engines perceive and interpret content and the way users and browsers read that content does not match. This is why it is so important to conduct document linearization -as part of a GAP analysis- before and after optimizing a document.

In many cases, document linearization reveals as a futility the dubious practice of trying to compute local or global keyword density values by visually scanning a document. More on this is given in The Keyword Density of Non-Sense article (4).

Filtration

Filtration refers to the process of deciding which terms should be used to represent the documents so that these can be used for

  1. describing the document's content.
  2. discriminating the document from the other documents in the collection.

Frequently used terms cannot be used for this purpose for two reasons. First, the number of documents that are relevant to a query is likely to be a small proportion of the collection. A term that will be effective in separating the relevant documents from the non-relevant documents, then, is likely to be a term that appears in a small number of documents. This means that high frequency terms are poor discriminators. The second reason is that terms appearing in many contexts do not define a topic or sub-topic of a document. As Ruthven and Lalmas state

"The more documents in which a term appears (the more contexts in which it is used) then the less likely it is to be a content-bearing term. Consequently it is less likely that the term is one of those terms that contribute to the user's relevance assessment. Hence, terms that appear in many documents are less likely to be the ones used by a searcher to discriminate between relevant and non-relevant documents."

For these reasons, frequently used terms or stopwords are removed from text streams. However, removing stopwords from one document at a time is time consuming. A cost-effective approach consists in removing all terms which appear commonly in the document collection, and which will not improve retrieval of relevant material.

This can be accomplished with a stopword library --a stop-list of terms to be removed. These lists can be either generic (applied to all collections) or specific (created for a given collection). Which occurrance threshold value defines which terms to be removed from the collection depends on individual implementations. For instance, in some IR systems terms that appear in more than 5% of a collection are removed. In others, terms that are not in the stop-list but appear in more than 50% of a collection are deemed as "negative terms" and are also removed to avoid weighting complications. This is the case of some MySQL-based systems that use probabilistic inverse document frequency (IDFP) weights. I explain this in the MySQL implementation of Term Weights article (5).

Stemming

Stemming refers to the process of reducing terms to their stems or root variant. Thus, "computer", "computing", "compute" is reduced to "comput" and "walks", "walking" and "walker" is reduced to "walk". Not all systems use the same type of stemmer. For English, the most popular stemmer is Martin Porter's Stemming Algorithm. As Professor C. J. (Keith) van Rijsbergen, a leading IR authority, mentioned to me by email this stemmer was developed as one of his research projects several decades ago.

Over the years, many have considered the advantages and disadvantages of using stemming. For instance, there is no doubt that stemming insures that documents containing variations of a given queried term are considered in the final answer set. Stemming also reduces the size of an inverted file. However, too much stemming is not practical and can annoy users. As IR expert Avi Rappoport has stated

"In many cases, the search engine can't show exact matches first, because the stemmed version is stored in the index. This may save disk space, but it annoys users no end. This is particularly a problem with verbs...It's easy enough to perform the stemming in the query and search for multiple versions of the word, rather than storing the stemmed version only." (6)

Weighting

Weighting is the final stage in most IR indexing applications. Terms are weighted according to a given weighting model which may include local weighting, global weighting or both. If local weights are used, then term weights are normally expressed as term frequencies, tf. If global weights are used, the weight of a term is given by IDF values. The most common (and basic) weighting scheme is one in which local and global weights are used (weight of a term = tf*IDF). This is commonly referred to as tf*IDF weighting. To learn more about weighting models, please read the Term Vector Theory and Keyword Weights series (7).

The series of steps involving indexing is illustrated in the following figure, which is a modification of Ruthven's and Lalmas's figure.

 Indexing Processes

Figure 1. Document linearization consists of markup removal (a) and tokenization (b).
Tokenization is followed by stopwords filtration (c), stemming (d) and weighting (e).
Document indexing involves (a) through (e).

Tutorial Review

Let's review what you have learned. Please answer the following questions.

  1. What is document linearization and why is so important?
  2. What is stemming? Provide examples.
  3. What is the difference between tokenization, filtration and stemming?
  4. Why documents must be transformed into representation of documents?
  5. My Challenge To You: Using Figure 1, plot term frequency (x-axis) vs weight (y-axis). Explain the overall shape of the resultant curve. What information can be extracted from this graph, if any? Show your work. Feel free to email me your response for evaluation/grading. No attachments, please.

    Hint: Assume that term weights are defined as percents and as w = (tf*idf)*100

I hope you have enjoyed this first tutorial.

Acknowledgements

The author thanks many IR colleagues for insightful discussions on several conventions used during indexing.

Next: Cosine Similarity and Term Weight Tutorial

References
  1. Search is HOT for U.S. Hispanics SeoRoundTable
  2. KW Density Not Used as FarBack as 1997 SeoChat
  3. A survey on the use of relevance feedback for information access systems; Ian Ruthven and Mounia Lalmas; Knowledge Engineering Review, 18(1):2003.
  4. The Keyword Density of Non-Sense E. Garcia
  5. Implementation and Application of Term Weights in a MySQL Environment E. Garcia
  6. KeyWord Stemming & Word Forms; Search Engine Watch Forums
  7. Term Vector Theory and Keyword Weights E. Garcia

Thank you for using this site.
Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -