Home - Contacts - Terms -

Mi Islita

WPSS: Web Page Scoring Systems

WPSS: A General Framework for Web Page Scoring Systems. PageRank, Probability Theory, Connectivity Metrics and Web Traffic: Where is the Connection?

Dr. E. Garcia
Mi Islita.com
Email | December 7, 2002 | Last Update: December 11, 2002

Topics

Web Page Scoring Systems (WPSS)

The WPSS Framework

Web Surfer Behavior

Web Surfer Dynamics

Web Surfing Interactions

Conclusion

Acknowledgements

Feedback from Readers

References

Web Page Scoring Systems (WPSS)

At the 2002 Conference of the W3C (Conference WWW2002, May 7-11, 2002, Honolulu, Hawaii, USA.), Diligenti, Gori and Maggini presented a general framework for the definition of Web Page Scoring Systems (WPSS) for horizontal and vertical search engines.

According to Diligenti, et al., (1).

"The page rank of hyper-textual documents can be thought of as a function of the document content and the hyperlinks. In this paper, we propose a general framework for Web Page Scoring Systems (WPSS) which incorporates and extends many of the relevant models proposed in the literature. The general web page scoring model proposed in this paper extends both PageRank (2) and the HITS scheme (3). In addition, the proposed model exhibits a number of novel features, which turn out to be very useful especially for focused (vertical) search. The content of the pages is combined with the web graphical structure giving rise to scoring mechanisms which are focused on a specific topic. Moreover, in the proposed model, vertical search schemes can take into account the mutual relationship amongst different topics. In doing so, the discovery of pages with high score for a given topic affects the score of pages with related topics."

The WPSS Framework

The WPSS paper is a masterpiece of research work. The authors have been able to derive explicitly the PageRank and HITS scoring systems as special cases of a general framework. Conditional probability, page connectivity and web surfing arguments are elegantly combined and presented by the authors (1 - 6). Readers interested in this framework and its implications for horizontal and vertical searches are invited to read the Web Page Scoring Systems for Horizontal and Vertical Search (1).

The general framework for WPSSs can be thought of as a general framework for modeling

  1. web surfer behavior
  2. web surfer dynamics
  3. web surfing interactions

Web Surfer Behavior

According to Diligenti, et al., (1)

"Random walk theory has been widely used to compute the absolute relevance of a page in the Web (2, 6). The Web is represented as a graph G, where each Web page is a node and a link between two nodes represents a hyperlink between the associated pages.

A common assumption is that the relevance xp of page p is represented by the probability of ending up in that page during a walk on this graph.

In the general framework we propose, we consider a complete model of the behavior of a user surfing
the Web. We assume that a Web surfer can perform one out of four atomic actions at each step of his/her traversal of the Web graph:

The authors explain,

"We can model the user behavior by a set of probabilities which depend on the current page:

The proposed framework is more complete than the PageRank framework.

In the PageRank framework, a Web surfer can only perform a forward action by clicking on links from the current page or by jumping to another page,

"PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank".

explain the creators of PageRank (4).

In the proposed general framework, this assumption is intuitively incorrect and unnecessary.

Web Surfer Dynamics

Dynamical Modeling (DM) has been defined as the art of modeling phenomena that change over time (5). Since a Web surfer can perform different actions while surfing the Web, horizontally or vertically, his/her behavior can be studied using DM. However, few has been written with regard to Web surfing dynamics in horizontal and vertical environments. Most of the page ranking models are static schemes for modeling horizontal searches.

The proposed framework recognizes the dynamical nature of a Web surfer.

Diligenti and coworkers explain,

"The model considers a temporal sequence of actions performed by the surfer and it can be used to compute the probability that the surfer is located in page p at time t, xp(t). The probability distribution on the pages of the Web is updated by taking into account the possible actions at time t + 1" (1).

This observation is very important. The authors are not describing a static scenario but a discrete dynamical system. Not only the location of a Web surfer at a time t is important, but which possible actions he/she performs at time t + 1.

Web Surfing Interactions

A Window to Deterministic Traffic. Many of the relevant models seem to ignore that the behavior of a Web surfer can be influenced by

  1. Surfer-surfer interactions;
    e.g., a web surfer is influenced by the actions or recommendations of another web surfer.
  2. Surfer-structure interactions;
    i.e., a web surfer may behave differently while searching horizontally (e.g., through a one-dimensional web ring), vertically (e.g., through a vertical portal) or while surfing through a prepatterned link structure.
  3. Structure-structure interactions;
    e.g., web surfing actions due to the presence of coupled or nested web-rings, directories within directories, patterned or self-similar link structures and the like.

Surfer-surfer interactions are well described in the proposed framework.

Diligenti et al. explain (1),

"In the probabilistic framework described so far, we can define a multivariable scheme by considering a pool of Web surfers each described by a single variable. Each surfer is characterized by his/her own way of browsing the Web modeled by using different parameter values in each state transition equation. By choosing proper values for these parameters we can choose different policies in evaluating the absolute importance of the pages. Moreover, the surfers may interact by accepting suggestions from each other."

"In order to model the activity of M different surfers, we use a set of state variables xq(i)(t) which represent the probability of each surfer i to be visiting page q at time t. The interaction among the surfers is modeled by a set of parameters which define the probability of the surfer k to accept the suggestion of the surfer i, thus jumping from the page he/she is visiting to the one visited by the surfer i. This interaction happens before the choice of the actions described previously. If we hypothesize that the interaction does not depend on the page the surfer k is currently visiting, the degree of interaction with the surfer i is modeled by the value b(i|k) which represents the probability for the surfer k of jumping to the page visited by the surfer i."

The authors explain (1),

"As an example, suppose that there are two surfers, "the novice" and "the expert". Surfer s1, the novice, blindly trusts the suggestions of surfer s2 as he/she believes s2 is an expert in discovering authoritative pages,
whereas s1 does not trust at all his/her own capabilities. In this case the complete dependence of the novice on the expert is modeled by choosing b(1|1) = 0 and b(2|1) = 1, i.e. the novice chooses the page visited by the expert with probability equal to 1.

Before taking any action, the surfer i repositions himself/herself in page p with probability vp(i)(t) looking at the suggestions of the other surfers."

The authors have elegantly captured the essence of deterministic Web traffic of the first kind: surfer-surfer interactions. This does not mean that the behavior of the Web surfer is one of complete determinism. It only means his/her behavior is not one of complete randomness (as assumed in the PageRank framework).

Armed with their general framework, the authors propose the following horizontal and vertical WPSSs

Horizontal WPSSs

  1. PageRank (with and without "sinks")
  2. HITS
  3. PageRank-HITS

Vertical WPSSs

  1. Focused PageRank
  2. Double Focused PageRank

In the vertical WPSSs, the behavior of the Web surfer is not one of complete randomness. For example, in the Focused PageRank system (FPS) the surfer is very much aware of what he/she is searching, but follows links according to the suggestions provided by a page classifier. As the authors explain,

"...he/she will trust the classifier suggestions following the links with a probability proportional to
the score of the page the links leads to."

FPS uses a topic-specific distribution for selecting a link to be followed. However, the decision on the action to take does not depend on the content of the current page. The Double Focused PageRank system (DFPS) takes into account that the decision about the action is usually dependent on the content of the current page.

The researchers explain,

"For example, let us suppose that two surfers are searching for a "Perl Language tutorial", and that the first one is located at the page "http://www.perl.com", while the second is located at the page "http://www.cnn.com". Clearly it is more likely that the first surfer will decide to follow a link from the current page while the second one will prefer to jump to another page which is related to the topic he is interested in."

After deriving the mathematical arguments for their horizontal and vertical scoring systems, the authors present their experimental results. Among others, the research group found that

  1. the two focused systems (FPS and DFPS) provide a smooth distribution of scores among pages
  2. FPS can filter many off-topic authoritative pages from the top positions
  3. DFPS can filter off-topic authorities, while pushing all the authorities on the relevant topic to the top positions
  4. the two focused systems provide a higher accuracy when searching focused authorities

The authors conclude their report with several interesting observations

Conclusion

Diligenti, Gori and Maggini have presented a general framework for Web Page Scoring Systems (WPSS) which incorporates, extends and improves many of the relevant models proposed in the literature. The authors were able to derive explicitly the PageRank and HITS scoring systems as special cases of a general framework. Many of the limitations found in these two systems were addressed by the research group.

The proposed framework incorporates a complete model of the behavior of a user surfing the Web. The model recognizes the dynamical nature of the Web surfer.

With two scoring systems (Focused PageRank and Double Focused PageRank), the authors demonstrated that the assumption of complete randomness of the underlying Web surfer is not accurate and cannot be held, especially when the Web surfer is performing topic-specific vertical searches.

The two vertical WPSSs were able to discriminate authorities on a given topic (off-topic filtering) without the use of external penalty functions.

The study demonstrates that a valid web page scoring system must take into account

The proposed general framework is a step in the right direction and an important advance in the study of scoring systems for horizontal and vertical searches.

Acknowledgements

The author would like to thank Michelangelo Diligenti from Dipartimento di Ingegneria dell'Informazione who kindly provided invaluable feedback on WPSS. PageRank is a Trademark of Google, Inc.

Feedback from Readers

This section, Last Edited: December 11, 02

One reader asked why PageRank produces a smooth scoring distribution among pages. Excellent question. In the original PageRank framework

  1. the probability of performing a random jump is x(j|p) = 1 - d
  2. the probability of following a link from the current page is x(l|p) = d

These probabilities are assumed to be constant and independent of the content of the pages. This assumption induces a constant smoothness in the scoring distribution function of PageRank which may not be necessarily constant when the content of the pages and the actual link structure of the Web graph are taken into consideration.

The above two probabilities, the 1 - d and d terms, although necessary are not enough to properly describe the probabilistic behavior of the Web surfer.

A third probability, i.e., the probability of landing into a page after a jump, x(p|j), must be taken into consideration. Surprisingly, this probability was not included in the PageRank framework.

Consider a user surfing a Web Graph G consisting of N pages. As correctly pointed out by Diligenti et al, the final destination or "target must be selected using a uniform probability distribution over all the N Web pages."

For N possible pages this uniform probability is x(p|j) = 1/N. Therefore, the probabilistic behavior of the Web surfer is best described by a scoring system which incorporates a term of the form

x(j|p)*x(p|j) = (1 - d)/N

The (1 - d)/N term is present in the improved versions of PageRank proposed by the authors and seems to be responsible for the smoothness of the corresponding scoring distribution functions.

A reader asked:

"If the 1 - d term is responsible for the smoothness observed in the scoring distribution of PageRank, what then accounts for a similar effect in the Double Focused PageRank?" This is a good question, since the (1 - d)/N term is not in this WPSS. We forwarded the question to Diligenti.

"There is still a smoothing factor in Double Focused PageRank but it is not constant. In particular, it depends on the content of each page: (1-d) * PS / PM where PS is the page score as computed by the classifier, PM is the maximum page score (as computed by the classifier).", Diligenti explained to me.

Note. The authors use a different notation in their original paper.

In the Double Focused PageRank the x(j|p) and x(l|p) probabilities depend on the relevance of the page score (as computed by the classifier) and on the maximum page score (as computed by the classifier). In addition, the probability of landing into a page after a jump no longer is x(p|j) = 1/N but proportional to its relevance.

These redefined probabilities make the Double Focused PageRank a crawler/human-interfaced scoring system and a "smart" probe for assessing the content or relevance of Web pages.

A reader asked what could be the main differences between Taher Haveliwala's Topic-Sensitive PageRank and the proposed Double Focused PageRank scoring system. (7, 8). We forwarded this question to Diligenti.

"Haveliwala's Topic-Sensitive PageRank is a combination of a ranking function, which is computed at query time (using a classification of the keywords). In particular, Haveliwala's Topic-Sensitive PageRank uses a simple surfer model to compute elementary ranking functions. Our Double Focused PageRank is a "smarter" surfer model. Our model could be used inside Haveliwala's Topic-Sensitive as elementary ranking function.", he stated.

The inherent roles of Web surfing dynamics and Web surfing interactions (or combined interactions) are still open questions. At the time of writing, Topic-Sensitive PageRank and non of the current web page scoring systems address these questions.

References
  1. http://www2002.org/CDROM/refereed/629/
  2. The PageRank Citation ranking: Bringing order to the web, L. Page, S. Brin, R. Motwani, and T. Winograd, tech. rep., Computer Science Department, Stanford University (1998).
  3. Authoritative sources in a hyperlinked environment, Jon Kleinberg, May Report RJ 10076, IBM (1997).
  4. The Anatomy of a Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page.
  5. Discrete Dynamical Systems, J. T. Sandefur, Oxford (1990).
  6. The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect, R. Lempel and S. Moran, ACM Transactions on Information Systems, vol. 19, pp. 131-160 (2001).
  7. Topic-Sensitive PageRank, Taher H. Haveliwala, Stanford University
  8. Efficient Computation of PageRank, Taher H. Haveliwala, Stanford University Technical Report (1999).
Thank you for using this site.
Status of the Current Document 
W3C CSS Validation  W3C XHTML Validation
Copyright © 2006 Mi Islita.com -