Home - Contacts - Terms -

Mi Islita

Artificial Intelligence FAQs

Artificial Intelligence Applied to Information Retrieval: Frequently Asked Questions (FAQs) on Artificial Intelligence Applied to Natural Language Processing, Search Agents, Parsers and Crawlers

Dr. E. Garcia
Mi Islita.com
Email | Last Update: 01/21/06

Topics

About Artificial Intelligence and IR

Suggested Topics and Resources

Conventions and Assumptions

Submitted AI and IR Questions

References

About Artificial Intelligence and IR

Artificial intelligence is a broad scientific field. Convince yourself by searching for artificial intelligence in Google or other search engines. This document focuses on artificial intelligence techniques, methods and projects applied to information retrieval and information extraction.

Suggested Topics and Resources

Since relevance is everything, I'm especially interested on issues relevant to AI agents applied to information retrieval and search engines. Please submit research questions that target or address these subjects, only. Some suggested topics are:

Talking about the science of fractals in AI and IR, I am currently compiling a resource page on the subject. If you come across articles, thesis, course material or call for papers on fractals applied to IR let me know. If the information is on-topic, I will place a link to the resource and to your site in the Fractals in Information Retrieval section. To discourage adversarial linkage (link spammers), I only accept resources from .edu or .gov levels. However, your domain does not need to be from these levels.

You might also want to research the following AI resouces

Conventions and Assumptions

  1. Answers have been broken down into steps for clarity.
  2. Readers understand pattern matching of regular expressions.
  3. Readers are familiar with AI, search engines and programming.
  4. Readers understand advanced concepts.

Submit a question.

Submitted AI and IR Questions

  1. On the Web, how do you emulate a parser that calls a file from another document by accessing a previous one?
    This is fairly easy to do with regular expressions. A trivial example that 
    emulates what you describe is given in Mark Newhouse's Daemon Skins article 
    (http://www.alistapart.com/d/daemonskin/sidebar.html). I've changed a bit the 
    naming convention and removed some white space. You don't need to do these 
    changes. Just use the script as suggested by Newhouse.
    
    While the example merely parses and writes out a file extension to a document 
    and no data from the file is actually parsed, the example illustrates the 
    power of regular expressions and, to some extent, emulates what a parser 
    can do. It also illustrates the passing of information between pages without 
    using cookies or server-side muscle. 
    
    Do this,
    
    FTP different CSS files (style-1.css, style-2.css, style-3.css, etc) to a 
    directory, dir. These can be yours or, if allowed, from third-party sites.
    
    Create and FTP to same directory, dir, two html docs, A and B. 
    Be sure you link to B from A using the following urls:
    
    http://www.yourdomain/dir/B.html?style-1.css
    http://www.yourdomain/dir/B.html?style-2.css
    http://www.yourdomain/dir/B.html?style-3.css
    
    Be sure you have placed in the head section of B the following script:
    
    var parentURL=""+document.location;
    var parSharpIndex=parentURL.indexOf("?");
    if(parSharpIndex >=0)
    {
    anchorString=parentURL.substring(parSharpIndex,parentURL.length);
    var MyRe=/\?(\w*)\.css/g;
    var MyArray=MyRe.exec(anchorString);
    var anchor=RegExp.$1;
    }
    else{anchor='style-1';}
    
    document.write('<link rel="stylesheet" href="'+anchor+'.css" type="text\/css">');
    
    You can also place this script in an external .js file 
    and call it from the head section of B. 
    
    When B is accessed from A by clicking one of the above links, the script 
    parses the document location, "?", any preceeding string, and dynamically 
    writes to B the link tag including the corresponding css file name. 
    If nothing is parsed, the script writes out the default css file name as 
    specified in the "else{anchor='_ _ _ _'}" line. In the example this is 'style-1'. 
    
    
    Note: If you change the name or location of any of these files, 
    be sure to change the url paths as well.
    
    Possible modifications: 
    
    (a) Rewrite script using a browser fork or date objects, so css files 
    can be appended according to a browser or a date/time stamp. This is 
    a great way of checking the dress of a given page using different 
    style instructions, browsers, operating systems or time zones.
    
    (b) Point to B from different documents or directories of your site and 
    declaring different css styles. Associate a counter or a cookie to each 
    css file. This is another way of tracking without using referer location.
    
    
  2. How do you define and measure similarity?
    Similarity is a measure of the association or relatedness between 
    objects characterised by discrete-state attributes. Similarity scores 
    quantify the likeness between objects assuming these can be grouped 
    into clusters with similar attributes (C.J. van Rijsbergen).
    
    Coefficients of association described in the literature:
    
    Dice
    Jaccard (Tanimoto)
    Cosine
    Overlap
    
    Salton's Index is a particular case of the Cosine Coefficient.
    Jaccard's Coefficients are known as Tanimoto's Coefficients in 
    computational chemistry and are used as molecular similarity estimates.
    
  3. What's the difference between Jaccard's Coefficient and the C-index metric, if any?
    This is a question I have answered so many times.
    
    Jaccard's Coefficients can be extracted from co-occurrence matrices, are 
    applied to pairwise co-occurrence, and are not affected by permutation.
    
    C-indices can be explained in terms of Probability and 
    Fuzzy Set Theory, can be applied to pairwise and 
    multiple o-occurrences, and can be affected by permutation. 
    
    For instance, for three terms, we can compute four c-indices;
    
    c12, c13, c23, c123
    
    and their corresponding permutations;
    
    c21, c31, c32, c132, c213, c231, c312, c321
    
    each one having their own EXACT-to-FINDALL ratios (EF-Ratios).
    
    For the particular case of pairwise co-occurrence in which 
    permutation does not matter; i.e, 
    
    c12 = c21
    
    and the c-indices reduce to a unique Jaccard's Coefficient.
    
    For additional information see the series on  
    
    Keywords Co-Occurrence and Semantic Connectivity (1).

  4. What are the main limitations of Term Vector Models?
    The assumption that terms are unrelated (independent) from each other.
    This is not true since we know that terms are related by
    
    Polysemy = a given term can have different meanings in different contexts
    
    Synonymity = different terms can be related by a common concept.
    
    Computation = Upgrades to the index of terms (vocabulary) require recalculation of 
    the term space and all term weights. This is computationally expensive.
    
    For additional information see the series on  
    
    Term Vector Theory and Keyword Weights (2).

  5. How do you define a parser? Are parsers web crawlers?
    A parser is an algorithm that determines the syntactic 
    structure of a piece of information (e.g., sentences, words, characters).  
    Regular expressions are used for this purpose.
    
    Unlike a parser, a web crawler is an algorithm that finds and extracts 
    information from specific elements (links, anchors, titles, etc) 
    while moving across the Web. In non technical terms, a crawler 
    discover resources, while a parser determines strings from such resources. 
    
    These algorithms are often combined. A crawler can be launched to starts at given URL 
    and domain, download the document and parse out the strings and URLs to create a 
    specific repository for a web search engine. Each crawled URL receives a DocID 
    which is then mapped to entries in the repository. After finishing each document 
    parsing, the next URL in the to-do list is taken out and crawled.
    
  6. What's the difference between search engines and "search tools"?
    A true search engine, like Google, is a hardware/software 
    architecture. The database indexing, ranking and purging 
    algorithms of true engines are as automated as possible. 
    The architecture part is designed to distribute huge 
    repositories of information, collected while the crawlers 
    traverse subgraphs of the Web.
    
    True search engines need the processing, storage, 
    connectivity and redundancy power of computing centers. 
    Such centers require diverse resources and manpower 
    management. (It has been claimed that Google has more 
    than 10 thousand linux servers). 
    
    From the marketing perspective, it is 
    comprehensible that some sites sell themselves as 
    "search engines". Still, don't be fooled by fake 
    imitations. A mere client- or server-side script 
    is just that, a script.  
    
    Many so-called search engines, are indeed "search engines 
    in a box" or "portal in a box" programs. 
    Some are mere oversized site search tools. Some are 
    client-side "database" files (easy to see); others are server-side 
    databases (easy to see and exhaust, as well), or are 
    client/server-side combos that use the host c:\Inetpub directory 
    folder. Even others are nothing but a mere link that connects to 
    external servers where the service is hosted.  
    
  7. What's the difference between proximity, phonic, stemming, numeric, fuzzy, concept, wildcard and macro searches?
    Proximity searching finds a term or phrase within "n" terms.
    Often, the proximity operator is "w/n" with n = 10 by default.
    n acts as a text range delimiter. Usually queries are as in
    "apple pie w/27 green beans"
    
    Phonic searching finds term(s) that sound alike. A search
    for "Smith" may return "Smythe". A search for "ice cream"
    may return "I scream".
    
    Stemming finds terms with a common root. A search for 
    "computer" may return "computers", "computerized".
    
    Numeric searching finds any number between two numbers.
    This is also called "numeric range searching".
    
    Fuzzy searching finds terms even if they are misspelled. With
    some search engines, the fuzziness level (FL) runs from 0 to 10
    and can be user-defined at query time (not index built-in).
    A search for "abracadabra" with FL=1 may find "abracadadra" 
    while with FL=2 may find "abracabadra" and "abracadadra".
    
    Concept searching uses a thesaurus. A search for "rapid"
    may return documents containing "quick", "fast" and the like.
    
    Wildcard searching uses letter(s) place holders. In some search
    engines a "?" holds a single letter place, while a "*" holds 
    multiple letter places. Other search engines may define wildcard
    searching in dissimilar manners or use other wildcard operators.
    
    Macros include frequently used items in a search request.
    
  8. How do we know if a web spider (crawler) got trapped in a server?
    This is unlikely to happen, unless the web crawler was poorly programmed 
    or did not follow the Robots Exclusion Protocol.  However, crawlers 
    can endanger a server if they get trapped into a maze of dynamic urls. 
    
    For instance, crawlers tend to avoid database-driven pages with dynamic URLs 
    containing /cgi-bin, cgi or bin, as well as CGI escape characters like 
    #, %, ^, &, ?, =, etc. Such databases usually lead to resources with recursive 
    or cross-pollinated links that can easily trap crawlers in a web of urls. Thus, 
    this poses a threat to the crawler which, buried under layers of urls, 
    tries to find its way out by consuming server requests. The result: a server 
    down in a couple of minutes. Because of this, web crawlers tend to avoid such 
    environments. Therefore, if you have a database-driven site or an all-dynamic 
    flash-based site, chances are that not all of your pages will be indexed by a 
    crawler-based search engine like Google (i.e., unless you have a strategic 
    business alliance with the search engine). 
    
    From the architectural standpoint, this make sense. Search engines 
    are databases, often limited by their own infrastructures. Thus, they are in the 
    business of indexing web pages, not other databases.  Still, if you need to trap 
    a crawler without bringing down your server, some workarounds are available.
  9. How do you compute a Dale-Chall Readability Index score (RI)?
    This is not that easy. Try this
    
    Select a text sample (T) of 100-150 words from an intermediate 
    or advanced level text (grades 5 to college) and use the
    Dale-Chall Readability Index equation
    
    RI = 0.0496*L + 0.1579*%DW + 3.6365
    
    where
    
    L = average sentence length and DW = percent difficult words.
    L = w/s = Number of words in T/Number of sentences in T.
    dw = number of words in T not found on a Dale Word Library.
    %DW = (dw/w)*100
    
    Note. Here we face an interesting artificial intelligence
    problem due to language ambiguities. This treatment 
    assumes that sentences are in English and ending 
    with either a period, exclamation or question mark. 
    One must consider when these characters are part of 
    a term (as in Yahoo!), if sentences end in other 
    characters, if sentences are not delimited by a space,
    or if the text sample is not in English.
    
    To assess RI values, you must know L and have access to
    (a) a Dale Word List library; i.e., to calculate DW. 
    (b) a Dale-Chall Raw Score-to-Grade Conversion Table; 
    i.e., to interpret results.
    
    A subroutine that includes a DW library and language
    ambiguity instances should minimize these problems. 
    To be on the safe side, L should be taken for an 
    estimated value. For exact results, count manually all 
    quantities and input them in a form script that uses the 
    Dale-Chall Readability Index equation.
    
  10. How could I discriminate between proximity, fuzzy logic and boolean searches?
    Proximity searches deals with searching for words that appear within
    certain distance, d. From the search engine side, the default 
    value usually is ten (d = 10). From the user side, d can be modified 
    with so-called advanced, expanded or power search features. Some
    engines use the w/n operator where n is a user-defined distance. 
    Fuzzy logic searches involve the finding of words with similar 
    spellings to a query. Boolean searches requires the use of boolean 
    operators such as AND, OR, NOT. Boolean searches can be 
    replaced with so-called arithmetic searches by querying the
    search engine with the  "+", "-" and quotation (", ') signs.
  11. Could you please describe a Brute Force Search Engine?
    These are search engines that use the so-called Brute Force (BF)
    pattern matching algorithm. The simplest way to explain BF is with
    an example. Consider a piece of text (T) consisting only of the word
    "abracabracadabra" and let say a user is searching for "abracad" 
    (the query pattern, P). In the BF algorithm, a window
    (W) of length L is placed on T and shifted forward. Assume L = 1, 
    so the window is moved forward by one string unit. Starting at 
    position 1 or the first "a" of the text, we check that all strings 
    in P and T match. If not, we shift W by one unit and check again 
    that all strings in P and T match. We keep shifting W until there is 
    a match between all strings in P and T. This is equivalent to 
    moving P over T, one string unit at-a-time until a match is found. 
    It's a quite primitive algorithm, from here its name. From the 
    processing standpoint, this algorithm is not efficient for finding 
    suffixes or words with similar endings. Another disadvantage is 
    that it can return matches for blocks of substrings from keywords, 
    not necessarily semantically connected. Because of this, 
    alternatives have been proposed.
    
    Many algorithms use a modification of this scheme, mainly differing 
    on the way they check for matches and shift the window. Some 
    variations shift the window backward in search of a given 
    suffix (Boyer-Moore family, BM). Others shift the window 
    forward but searching for prefixes (Knuth-Morris-Pratt, KMP). A 
    variation of KMP is the Aho-Corasick (AC) algorithm. Here a prefix 
    is reflected on several portions of the text (i.e. document) to be 
    searched. Such reflexions are also called transitions, (some 
    algorithms use the term "reflexion" in different contexts) in 
    which the patterns are arranged in a tree-like data structure. 
    
    Challenge: Arrange the following words: "hello", "elbow", "eleven", 
    "even" in a tree-like data structure and find all possible failure 
    transitions. Measure the semantic connectivity between paired terms.
    
  12. How do you calculate recall (R) and precision (P) metrics of IR systems?
    R is the fraction of relevant documents which has been retrieved.
    P is the fraction of retrieved documents which is relevant.
    
  13. How do you calculate Salton's term weights?
    This is described in  
    
    Term Vector Theory and Keyword Weights (2)

References
  1. Keywords Co-Occurrence and Semantic Connectivity
  2. Term Vector Theory and Keyword Weights

Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -