Information Retrieval: Algorithms and Heuristics
Review of Grossman and Frieder's book on information retrieval, algorithms and heuristics
Dr. E. Garcia
Mi Islita.com
Email | Last Update: 09/03/05
Topics
Information Retrieval
Audience
Structure
Highlights
Comments
Recommendations
References
Information Retrieval: Algorithms and Heuristics
By David A. Grossman and Ophir Frieder. Springer, 2nd Edition, 2004, 332p., illus., biblio., index. (The Kluwer International Series on Information Retrieval). ISBN 1-4020-3004-5 (PB) $38.00.
Audience
Information Retrieval is a textbook for computer science students and a reference book for IR practitioners.
Structure
The book structure consists of a Contents, List of Figures, Forewords, Acknowledgments, nine chapters, References and Index sections. With forewords by Bruce Croft, each chapter of the book ends with a Summary and Exercises section. The first part of the book addresses relevancy and algorithms, while the second part focuses on architecture.
Subjects: Computer Science, Information Retrieval, Internet, Search Engines
Table of Contents
Dedication v
List of Figures xi
Foreword xiii
Preface xv
Acknowledgments xix
1. Introduction 1
2. Retrieval Strategies 9
3. Retrieval Utilities 93
4. Cross-Language Information Retrieval 149
5. Efficiency 181
6. Integrating Structured Data and Text 211
7. Parallel Information Retrieval 257
8. Distributed Information Retrieval 275
9. Summary and Future Directions 291
References 299
Index 331
Highlights
Chapter 1 briefly discusses precision and recall curves and states that the primary focus of the book is the ad hoc aspect of IR; i.e., the retrieval of relevant information in response to user queries.
Chapter 2 provides a comprehensive explanation of term weighting techniques relevant to the Vector Space Model. Poisson, Latent Semantic Indexing and Inference Network models are also covered. One section of the chapter is dedicated to genetic algorithms.
Chapter 3 discusses the following utilities: relevance feedback, clustering, passage-based retrieval, parsing, N-grams, thesauri, semantic networks and regression analysis. The authors provide specific examples with how-to calculations. The section on semantic networks covers R-distance and K-distance measures.
Chapter 4 presents a rarely seen and refreshing discussion on bilingual corpus strategies, cross language relevance feedback and relevance scores for English/Spanish terms. This differentiates the book from traditional IR textbooks.
Chapter 5 covers standard efficiency methods such as inverted index building, compression, pruning, and signature files. Detection of duplicated content is limited to query-independent methods.
Chapter 6 covers relational databases and Boolean Retrieval. In this chapter, the authors provide a hands-on approach to information retrieval through 23 SQL snippets.
Chapters 7 and 8 include some information on query log analysis and PageRank. The main focus of these chapters is architecture, in particular parallel indexing, distributed retrieval and peer-to-peer.Chapter 9 provides a summary of activities and future directions within the IR and TREC community.
Comments
The References section consists of 32 pages and is an excellent starting point for conducting literature research. However, the Index section is limited to less than two pages, which affects its usability.
The best features of the book are its readability -reviewed by students and IR scientists- and the authors' willingness to provide step-by-step hands-on examples and plenty of graphics. The authors' transparency with how-to calculations encourages readers to learn by doing. However, the book usability could be improved by expanding its Index section and by adding a Glossary of Terms and Answers to Exercises sections.
While reviewing the material relevant to the Extended Boolean Model (1, 2), we found in page 69 what appears to be a misplacement error. This is found in the equation for similarity scores using the AND operator. The p-norm exponent was placed inside parentheses; i.e. (1 - w p), when it should be placed outside of these; i.e., (1 - w) p. (E. G. Note: Few days ago, I informed the authors about this finding and they were forthcoming in acknowledging the error).
The book does not cover detection of duplicated content based on query-dependent methods (3), search results filtering based on re-ranking methods, personalization or collaborative filtering. Future editions of the book should include these subjects since they affect the retrieval of relevant information in response to users' queries.
Recommendations
This book is recommended for IR practitioners, computer science departments and technical libraries. It is also recommended for search engine optimization specialists and web developers that need to incorporate search technology, ranking algorithms and heuristics into their projects.
References
- Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types; E. Fox, Cornell University dissertation, Aug. 1983. Available from: University Microfilms International, Ann Arbor, Michigan.
- The Extended Boolean Model; E. Garcia (2005).
- Patents on Duplicated Content and Re-Ranking Methods; E. Garcia, Search Engine Strategies 2005 Conference, San Jose, CA; August 8 - 11 (2005).

