The Fractal Nature of Semantics
"The best known fractals are the Koch Curve, the Sierpinski Triangle, the Lorenz Attractor, and the Bifurcation Diagram from Chaos Theory. Others well-known fractals are the Mandelbrot, Cantor and Julia Sets. Less obvious fractals are found in relevancy scores, semantics, and link graphs."
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 09/24/05
Article 1 of the series The Fractal Nature of Semantics
Copy Guidelines for students, universities, and public interested in reproducing Mi Islita.com material.
Topics
Fractals Revisited
Defining Fractals
Fractal Spam?
Afterthoughts - SERPs
Precision-Recall Experiments
Additional Experiments, E-Measures
References
Fractals Revisited
Those familiar with the history of fractal geometry know that the term "fractal" was derived by Benoit B. Mandelbrot from the Latin adjective "fractus". As noted by Mandelbrot, "fractus" indicates "irregular", and the corresponding verb "frangere" means "to break"; i.e., to create irregular fragments (1). Thus, fractals are irregular, rough, and fragmented shapes.
Relegated as geometric anomalies and "monster curves", fractals are scaled-down versions of shapes in which each part resembles the whole thing. Think of these as patterns within patterns. Figure 1 shows a gallery of common fractals, reproduced with a fractal generator software.

Figure 1. A Fractal Art Gallery.
From left to right, the first image corresponds to the well-known Helge von Koch Curve. Next we have a picture of a "wallpaper" pattern known as the Sierpinski Triangle. This is followed by a pattern that resembles a butterfly: the Lorenz Attractor. Discovered by Professor Edward Lorenz back in the early 60's, the attractor has been used to understand non linear dynamical systems and chaos. Next is the Bifurcation Diagram extracted from the Logistic Equation. Finally, we have two famous sets: the Mandelbrot and Cantor Sets. Not shown in the figure is the family of Julia Sets, which can be obtained from the Mandelbrot Set.
Less obvious fractals can be found in relevance scores and link graphs. After all, there is a short distance between Zipf's, Heap's, and Mandelbrot's Laws with Shannon Information Theory and Salton's Vector Space Model. Even the link structure of the Web describes a fractal architecture (1 - 5). That these patterns are everywhere is evident considering that expressions such as "fractal screensaver", "fractal music", and "fractal dimensions" are now part of mainstream. And how about art techniques from the past?
Recently, The Guardian published Maurizio Seracini's epic work on The Adoration of the Magi, painted by Leonardo Da Vinci. Seracini found that what lies below the surface of the Adoration of the Magi are images within images, each one with its own telling and story. The hidden images are well discernible through different length scales of observation using infrared refractography. So far, around 2,400 images have been discovered. It is expected the discovery to fuel new conspiracy theories and fans of Dan Brown's book The Da Vinci Code (6).
All this tells us that these patterns, man made or found in Nature, are not an isolated curiosity. Indeed, patterns having a scaling, power law component known as self-similarity are very common.
Defining Fractals
Mandelbrot tentatively defined a fractal as a set for which the Hausdorff Besicovitch dimension strictly exceeds the object topological dimension, Dh > Dt. Thus, 1-, 2- or 3-dimensional fractal objects posses fractal dimensions greater than 1, 2, or 3.
The term "fractal" can be used as a noun or adjective. A consensus on how to define the term is still lacking. After tentatively defining the term, Mandelbrot eventually pointed out that one could do better without definitions (1). Apparently he prefers the following description:
A fractal is shape made of parts similar to the whole in some way" (2).
In the absence of a better characterization, this description is used here as a practical working definition for fractal subjects.
An online introduction on fractals is given by Bourke. Fractal subjects (objects or time-dependent processes) are characterized by a power law of the general form
F(L) = k(1/L)D
Where F(L) is an arbitrary function describing a property of the object or time-based process, k is a proportionality constant, L is a relative length scale factor and D is its fractal dimension, which often is a fractional value. If a power law behavior describing self-similarity does exists then D can be evaluated from the slope of a double-logarithmic graph of F(L) versus (1/L).
Fractal Spam?
Fractals appear in all fields of Science, Math and Life (3). I like to refer grad students to three practical models that consumed the 1988-1995 window of my life
- diffussion-limited aggregation.
- electrochemical deposition of ramified patterns.
- predator-prey population models based on non linear dynamics (chaos theory) and fractals.
But what about the Web and search engine behaviors? What about the notion of relevancy and apparently disordered link structures? As someone interested in applied fractal geometry in very dissimilar fields I often find myself dealing with these and similar questions and trying to convince colleage scientists on the fractal nature of everything.
Then one day it occurred me that scaling behaviors could be studied in a controlled environment by just looking at search engine result pages (SERPs). Checking for the effect of relevancy at different length scale of observations in a given document or database is indeed an intriguing and appealing experiment.
Rather than getting into theoretical details, I prefer to illustrate one of many experiment one could perform:
- Query an experimental IR system, copy and paste a SERP from the top 10 results (L=10) into a new document. Now the document is a scaled-down version of the SERPs.
- Reindex the document into an IR system and check how it ranks. Check other relevancy scores, check a recall-precision curve, etc.
- Do this for different L values.
- Plot 1/L vs some suitable parameter (I'm not going to tell you which one).
- Repeat experiment but this time optimizing the documents to make them search engine-friendly.
- Do two more experiment, this time with commercial search engines. Do one experiment using the same engine and the other submitting the scaled-down version to a different search engine.
- Use a matrix and test other combination of conditions. Compare results.
Afterthoughts - SERPs
While making travel arrangements for the next weeks, I was thinking about this and related research I currently conduct on the fractal nature of relevancy, Web nodes and IR systems. Then it occurred to me there should be an iterative procedure I could study, strictly in connection with relevancy induction or, let say for spamming a search engine.
Then I decided to "command search" Google for the top 100 results and for several topics I have discussed with search engine optimization and marketing specialists (SEOs/SEMs). I came across some SERPs consisting of records (titles, descriptions, urls, etc.) apparently relevant to my query. Clicking a record sent me to a page (no redirection involved) about a commercial product. The content of the document wasn't relevant to my query or the SERPs. How does this could be possible?
Here is the thing. At the top of the document I saw Google's AdWords, banner ads and other marketing material followed by a carbon copy of Google's top 10 search results for my intended query. But I found no traces of the record relevant to my query and displayed in Google's SERPs. Confused? Stay with me, please.
Then I checked the cache version and realized that the old copy contained similar marketing material as before and a carbon copy of Google's SERPs, but this time with an entry containing a portion of one of my posts at SEWF, which was the text shown in the original query results page. WOW! I can spam Google with its own output as input.
I have many examples of this. Then I decided to start a thread on this at the Search Engine Watch Forums. Since SEWF has a "no spam report" policy, I'll not provide names and hope if you post at SEWF, please keep in mind that outing someone is out. I will not add anything else except this.
This is not a novelty and has been occurring for a while with static and dynamic pages and with single documents or small database-driven documents. It is not a mere case of affiliated search results embedded into documents. Are these probably cases falling through the cracks? I don't know. Still, nice recursive way of spamming a system.
Recursively spamming IR systems or search engines with their own top N search results, entries, links, titles, urls, and descriptive texts that are already relevant to the initial query.... Hum. Figure 2 shows some scenarios/tests I could examine under controlled conditions if I have access to a lab IR system.

Figure 2. SERP-based Spam Experiments
The goal of the experiments described above is not exactly to spam a system. Rather we are interested in finding out how an iterative approach with resized copies of SERPs could be used to test the scoring performance of the target IR system so we can
- make it robust and better.
- understand its recall/precision behavior.
- see how it scores SERPs from other IR systems or engines.
Such recursive approaches could be used for both exposing and improving retrieval.
Precision-Recall Experiments
According to Baeza-Yates and Ribeiro-Neto, (4)
- Recall = |Ra|/|R|, is the fraction of the relevant documents (the set R) which has been retrieved.
- Precision = |Ra|/|A|, is the fraction of retrieved documents (the set A) which is relevant.

Figure 3. Visualization of Recall and Precision
Thus, precision-recall figures (curves) for several queries can be constructed and averaged. The authors write
"Such averaged figures are normally used to compare the retrieval performance of distinct retrieval algorithms...One additional approach is to compute average precision at given document cutoff values. For instance, we can compute the average precision when 5, 10, 15, 20, 30, or 100 relevant documents have been seen...Average precision versus recall figures are now a standard evaluation strategy for information retrieval systems and are used extensively in the information retrieval literature. They are useful because they allow us to evaluate quantitatively both the quality of the overall answer set and the breadth of the retrieval algorithm. Further, they are simple, intuitive, and can be combined in a single curve. However, precision versus recall figures also have their disadvantages and their widespread usage has been criticized in the literature."
This passage inspired me to conduct several iterative experiments in which the SERPs from an IR system or search engine are taken for answer sets and tested as follows.
- Query a system and construct several precision-recall curves using a variable size consisting of the top 10, 20,...100,... N documents. Examine the effect of N on the precision-recall curves.
- Convert each of the SERPs into separate HTML documents so now each of these are resized copies of the SERPs.
- Treat the resultant group of documents as a test collection, reindex and query this.
- Construct an overall precision-recall curve for the test collection. Compare this curve with the "scaled-down" curves.
Note. To start, design experiments in such way that you end with a test collection of 10 documents. Then try with larger test collections.

Figure 4. Iterative SERP Experiment
Additional Experiments - E-Measures
- Check the retrieval performance of the ranking algorithm and how this ranks the documents.
- Pay attention to document lengths, term frequencies, average distances between terms, etc.
- Compare results with other retrieval algorithms.
- Conduct experiments using the Evaluation Measure, E.

Figure 5. E-Measure
where E can be computed for each document and b is a measure of the relative importance of precision (P) or recall (R) as specified by users. This term allows for personalization.
- For b > 1 the user is more interested in P than in R.
- For b < 1 the user is more interested in R than in P.
- For b = 1 E works as the complement of the harmonic measure, F.
Next: Motifs and Iterated Function Systems
References
- B. B. Mandelbrot, The Fractal Geometry of Nature, W. H. Freeman, New York (1983).
- J. Feder Fractals, Reference 2, page 11; Plenum Press, New York, (1988).
- H.-O. Peitgen, H. Jurgens, and D. Saupe, Fractals for the Classroom Part One and Two, Springer-Verlag, New York (1992).
- R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval Chapter 3, page 75; ACM Press (1999).
- C.J. van Rijsbergen Information Retrieval, Butterworths (1979).
- John Hopper Rome Art detective exposes hidden images to fuel Da Vinci Code conspiracies; The Guardian, 09-20-05 (2005).

