Cluster-based retrieval from a language modeling perspective

by Liu, Xiaoyong

Abstract (Summary)
The standard approach to document retrieval is to assume that the relevance of documents could be assessed independently. The fact that a document is relevant does not contribute to predicting the relevance of a closely-related document. Cluster-based retrieval, on the other hand, assumes that the probability of relevance of a document should depend on the relevance of other similar documents to the same query. The goal is to find the best group of documents. The most common approach to cluster-based retrieval, which was proposed in the 1970s, is to retrieve one or more clusters in their entirety to a query. Research in this area has suggested that "optimal" clusters exist that, if retrieved, would yield very large improvements in effectiveness relative to document retrieval. However, no real retrieval strategy has achieved this result. Except for precision-oriented searches on very small data sets, document retrieval is found to be generally more effective. There has been a resurgence of research in cluster-based retrieval in the past few years including our own efforts in this area. The general approach is to use clusters as a form of document smoothing. Studies have shown that clusters can indeed improve retrieval performance automatically on modern test collections and the language modeling framework is an effective probabilistic retrieval framework for studying this type of problems. This thesis revisits the problem of retrieving the best group of documents, from the language-modeling perspective. We study both cluster smoothing and cluster retrieval. We analyze the advantages and disadvantages of a range of representation techniques, derive features that characterize good document clusters, and develop new probabilistic representations that capture the identified features. An extensive empirical evaluation is provided for various techniques proposed in this work. We find that whether good document clusters could be successfully identified or utilized by an IR system largely depends on how they are represented. Both the CBDM model for cluster smoothing and the geometric mean representation for cluster retrieval are shown to be effective approaches for cluster-based retrieval.
Bibliographical Information:


School:University of Massachusetts Amherst

School Location:USA - Massachusetts

Source Type:Master's Thesis



Date of Publication:01/01/2008

© 2009 All Rights Reserved.