Artificial Intelligence FAQs
Artificial Intelligence Applied to Information Retrieval: Frequently Asked Questions (FAQs) on Artificial Intelligence Applied to Natural Language Processing, Search Agents, Parsers and Crawlers
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 01/21/06
Topics
About Artificial Intelligence and IR
Suggested Topics and Resources
Conventions and Assumptions
Submitted AI and IR Questions
References
About Artificial Intelligence and IR
Artificial intelligence is a broad scientific field. Convince yourself by searching for artificial intelligence in Google or other search engines. This document focuses on artificial intelligence techniques, methods and projects applied to information retrieval and information extraction.
Suggested Topics and Resources
Since relevance is everything, I'm especially interested on issues relevant to AI agents applied to information retrieval and search engines. Please submit research questions that target or address these subjects, only. Some suggested topics are:
- machine learning research
- natural language processing (NLP)
- latent semantic indexing (LSI)
- AI agents, crawlers, and parsers
- search engines ranking algorithms
- orthogonal & non orthogonal term vectors
- business intelligence and marketing
- integrated optimization and analysis
- keyword services optimization
- L-Systems grammar, fractals and spectral analysis
Talking about the science of fractals in AI and IR, I am currently compiling a resource page on the subject. If you come across articles, thesis, course material or call for papers on fractals applied to IR let me know. If the information is on-topic, I will place a link to the resource and to your site in the Fractals in Information Retrieval section. To discourage adversarial linkage (link spammers), I only accept resources from .edu or .gov levels. However, your domain does not need to be from these levels.
You might also want to research the following AI resouces
Conventions and Assumptions
- Answers have been broken down into steps for clarity.
- Readers understand pattern matching of regular expressions.
- Readers are familiar with AI, search engines and programming.
- Readers understand advanced concepts.
Submit a question.
Submitted AI and IR Questions
- On the Web, how do you emulate a parser that calls a file from another document by accessing a previous one?
This is fairly easy to do with regular expressions. A trivial example that emulates what you describe is given in Mark Newhouse's Daemon Skins article (http://www.alistapart.com/d/daemonskin/sidebar.html). I've changed a bit the naming convention and removed some white space. You don't need to do these changes. Just use the script as suggested by Newhouse. While the example merely parses and writes out a file extension to a document and no data from the file is actually parsed, the example illustrates the power of regular expressions and, to some extent, emulates what a parser can do. It also illustrates the passing of information between pages without using cookies or server-side muscle. Do this, FTP different CSS files (style-1.css, style-2.css, style-3.css, etc) to a directory, dir. These can be yours or, if allowed, from third-party sites. Create and FTP to same directory, dir, two html docs, A and B. Be sure you link to B from A using the following urls: http://www.yourdomain/dir/B.html?style-1.css http://www.yourdomain/dir/B.html?style-2.css http://www.yourdomain/dir/B.html?style-3.css Be sure you have placed in the head section of B the following script: var parentURL=""+document.location; var parSharpIndex=parentURL.indexOf("?"); if(parSharpIndex >=0) { anchorString=parentURL.substring(parSharpIndex,parentURL.length); var MyRe=/\?(\w*)\.css/g; var MyArray=MyRe.exec(anchorString); var anchor=RegExp.$1; } else{anchor='style-1';} document.write('<link rel="stylesheet" href="'+anchor+'.css" type="text\/css">'); You can also place this script in an external .js file and call it from the head section of B. When B is accessed from A by clicking one of the above links, the script parses the document location, "?", any preceeding string, and dynamically writes to B the link tag including the corresponding css file name. If nothing is parsed, the script writes out the default css file name as specified in the "else{anchor='_ _ _ _'}" line. In the example this is 'style-1'. Note: If you change the name or location of any of these files, be sure to change the url paths as well. Possible modifications: (a) Rewrite script using a browser fork or date objects, so css files can be appended according to a browser or a date/time stamp. This is a great way of checking the dress of a given page using different style instructions, browsers, operating systems or time zones. (b) Point to B from different documents or directories of your site and declaring different css styles. Associate a counter or a cookie to each css file. This is another way of tracking without using referer location. - How do you define and measure similarity?
Similarity is a measure of the association or relatedness between objects characterised by discrete-state attributes. Similarity scores quantify the likeness between objects assuming these can be grouped into clusters with similar attributes (C.J. van Rijsbergen). Coefficients of association described in the literature: Dice Jaccard (Tanimoto) Cosine Overlap Salton's Index is a particular case of the Cosine Coefficient. Jaccard's Coefficients are known as Tanimoto's Coefficients in computational chemistry and are used as molecular similarity estimates.
- What's the difference between Jaccard's Coefficient and
the C-index metric, if any?
This is a question I have answered so many times. Jaccard's Coefficients can be extracted from co-occurrence matrices, are applied to pairwise co-occurrence, and are not affected by permutation. C-indices can be explained in terms of Probability and Fuzzy Set Theory, can be applied to pairwise and multiple o-occurrences, and can be affected by permutation. For instance, for three terms, we can compute four c-indices; c12, c13, c23, c123 and their corresponding permutations; c21, c31, c32, c132, c213, c231, c312, c321 each one having their own EXACT-to-FINDALL ratios (EF-Ratios). For the particular case of pairwise co-occurrence in which permutation does not matter; i.e, c12 = c21 and the c-indices reduce to a unique Jaccard's Coefficient. For additional information see the series on
Keywords Co-Occurrence and Semantic Connectivity (1). - What are the main limitations of Term Vector Models?
The assumption that terms are unrelated (independent) from each other. This is not true since we know that terms are related by Polysemy = a given term can have different meanings in different contexts Synonymity = different terms can be related by a common concept. Computation = Upgrades to the index of terms (vocabulary) require recalculation of the term space and all term weights. This is computationally expensive. For additional information see the series on
Term Vector Theory and Keyword Weights (2). - How do you define a parser? Are parsers web crawlers?
A parser is an algorithm that determines the syntactic structure of a piece of information (e.g., sentences, words, characters). Regular expressions are used for this purpose. Unlike a parser, a web crawler is an algorithm that finds and extracts information from specific elements (links, anchors, titles, etc) while moving across the Web. In non technical terms, a crawler discover resources, while a parser determines strings from such resources. These algorithms are often combined. A crawler can be launched to starts at given URL and domain, download the document and parse out the strings and URLs to create a specific repository for a web search engine. Each crawled URL receives a DocID which is then mapped to entries in the repository. After finishing each document parsing, the next URL in the to-do list is taken out and crawled.
- What's the difference between search engines and "search tools"?
A true search engine, like Google, is a hardware/software architecture. The database indexing, ranking and purging algorithms of true engines are as automated as possible. The architecture part is designed to distribute huge repositories of information, collected while the crawlers traverse subgraphs of the Web. True search engines need the processing, storage, connectivity and redundancy power of computing centers. Such centers require diverse resources and manpower management. (It has been claimed that Google has more than 10 thousand linux servers). From the marketing perspective, it is comprehensible that some sites sell themselves as "search engines". Still, don't be fooled by fake imitations. A mere client- or server-side script is just that, a script. Many so-called search engines, are indeed "search engines in a box" or "portal in a box" programs. Some are mere oversized site search tools. Some are client-side "database" files (easy to see); others are server-side databases (easy to see and exhaust, as well), or are client/server-side combos that use the host c:\Inetpub directory folder. Even others are nothing but a mere link that connects to external servers where the service is hosted.
- What's the difference between proximity, phonic, stemming, numeric, fuzzy, concept, wildcard and macro searches?
Proximity searching finds a term or phrase within "n" terms. Often, the proximity operator is "w/n" with n = 10 by default. n acts as a text range delimiter. Usually queries are as in "apple pie w/27 green beans" Phonic searching finds term(s) that sound alike. A search for "Smith" may return "Smythe". A search for "ice cream" may return "I scream". Stemming finds terms with a common root. A search for "computer" may return "computers", "computerized". Numeric searching finds any number between two numbers. This is also called "numeric range searching". Fuzzy searching finds terms even if they are misspelled. With some search engines, the fuzziness level (FL) runs from 0 to 10 and can be user-defined at query time (not index built-in). A search for "abracadabra" with FL=1 may find "abracadadra" while with FL=2 may find "abracabadra" and "abracadadra". Concept searching uses a thesaurus. A search for "rapid" may return documents containing "quick", "fast" and the like. Wildcard searching uses letter(s) place holders. In some search engines a "?" holds a single letter place, while a "*" holds multiple letter places. Other search engines may define wildcard searching in dissimilar manners or use other wildcard operators. Macros include frequently used items in a search request.
- How do we know if a web spider (crawler) got trapped in a server?
This is unlikely to happen, unless the web crawler was poorly programmed or did not follow the Robots Exclusion Protocol. However, crawlers can endanger a server if they get trapped into a maze of dynamic urls. For instance, crawlers tend to avoid database-driven pages with dynamic URLs containing /cgi-bin, cgi or bin, as well as CGI escape characters like #, %, ^, &, ?, =, etc. Such databases usually lead to resources with recursive or cross-pollinated links that can easily trap crawlers in a web of urls. Thus, this poses a threat to the crawler which, buried under layers of urls, tries to find its way out by consuming server requests. The result: a server down in a couple of minutes. Because of this, web crawlers tend to avoid such environments. Therefore, if you have a database-driven site or an all-dynamic flash-based site, chances are that not all of your pages will be indexed by a crawler-based search engine like Google (i.e., unless you have a strategic business alliance with the search engine). From the architectural standpoint, this make sense. Search engines are databases, often limited by their own infrastructures. Thus, they are in the business of indexing web pages, not other databases. Still, if you need to trap a crawler without bringing down your server, some workarounds are available.
- How do you compute a Dale-Chall Readability Index score (RI)?
This is not that easy. Try this Select a text sample (T) of 100-150 words from an intermediate or advanced level text (grades 5 to college) and use the Dale-Chall Readability Index equation RI = 0.0496*L + 0.1579*%DW + 3.6365 where L = average sentence length and DW = percent difficult words. L = w/s = Number of words in T/Number of sentences in T. dw = number of words in T not found on a Dale Word Library. %DW = (dw/w)*100 Note. Here we face an interesting artificial intelligence problem due to language ambiguities. This treatment assumes that sentences are in English and ending with either a period, exclamation or question mark. One must consider when these characters are part of a term (as in Yahoo!), if sentences end in other characters, if sentences are not delimited by a space, or if the text sample is not in English. To assess RI values, you must know L and have access to (a) a Dale Word List library; i.e., to calculate DW. (b) a Dale-Chall Raw Score-to-Grade Conversion Table; i.e., to interpret results. A subroutine that includes a DW library and language ambiguity instances should minimize these problems. To be on the safe side, L should be taken for an estimated value. For exact results, count manually all quantities and input them in a form script that uses the Dale-Chall Readability Index equation.
- How could I discriminate between proximity, fuzzy logic and boolean searches?
Proximity searches deals with searching for words that appear within certain distance, d. From the search engine side, the default value usually is ten (d = 10). From the user side, d can be modified with so-called advanced, expanded or power search features. Some engines use the w/n operator where n is a user-defined distance. Fuzzy logic searches involve the finding of words with similar spellings to a query. Boolean searches requires the use of boolean operators such as AND, OR, NOT. Boolean searches can be replaced with so-called arithmetic searches by querying the search engine with the "+", "-" and quotation (", ') signs. - Could you please describe a Brute Force Search Engine?
These are search engines that use the so-called Brute Force (BF) pattern matching algorithm. The simplest way to explain BF is with an example. Consider a piece of text (T) consisting only of the word "abracabracadabra" and let say a user is searching for "abracad" (the query pattern, P). In the BF algorithm, a window (W) of length L is placed on T and shifted forward. Assume L = 1, so the window is moved forward by one string unit. Starting at position 1 or the first "a" of the text, we check that all strings in P and T match. If not, we shift W by one unit and check again that all strings in P and T match. We keep shifting W until there is a match between all strings in P and T. This is equivalent to moving P over T, one string unit at-a-time until a match is found. It's a quite primitive algorithm, from here its name. From the processing standpoint, this algorithm is not efficient for finding suffixes or words with similar endings. Another disadvantage is that it can return matches for blocks of substrings from keywords, not necessarily semantically connected. Because of this, alternatives have been proposed. Many algorithms use a modification of this scheme, mainly differing on the way they check for matches and shift the window. Some variations shift the window backward in search of a given suffix (Boyer-Moore family, BM). Others shift the window forward but searching for prefixes (Knuth-Morris-Pratt, KMP). A variation of KMP is the Aho-Corasick (AC) algorithm. Here a prefix is reflected on several portions of the text (i.e. document) to be searched. Such reflexions are also called transitions, (some algorithms use the term "reflexion" in different contexts) in which the patterns are arranged in a tree-like data structure. Challenge: Arrange the following words: "hello", "elbow", "eleven", "even" in a tree-like data structure and find all possible failure transitions. Measure the semantic connectivity between paired terms.
- How do you calculate recall (R) and precision (P) metrics of IR systems?
R is the fraction of relevant documents which has been retrieved. P is the fraction of retrieved documents which is relevant.
- How do you calculate Salton's term weights?
This is described in
Term Vector Theory and Keyword Weights (2)

