Representation and learning in information retrieval
Abstract (Summary)This dissertation introduces a new theoretical model for text classification systems, including systems for document retrieval, automated indexing, electronic mail filtering, and similar tasks. The Concept Learning model emphasizes the role of manual and automated feature selection and classifier formation in text classification. It enables drawing on results from statistics and machine learning in explaining the effectiveness of alternate representations of text, and specifies desirable characteristics of text representations. The use of syntactic parsing to produce indexing phrases has been widely investigated as a possible route to better text representations. Experiments with syntactic phrase indexing, however, have never yielded significant improvements in text retrieval performance. The Concept Learning model suggests that the poor statistical characteristics of a syntactic indexing phrase representation negate its desirable semantic characteristics. The application of term clustering to this representation to improve its statistical properties while retaining its desirable meaning properties is proposed. Standard term clustering strategies from information retrieval (IR), based on cooccurrence of indexing terms in documents or groups of documents, were tested on a syntactic indexing phrase representation. In experiments using a standard text retrieval test collection, small effectiveness improvements were obtained. As a means of evaluating representation quality, a text retrieval test collection introduces a number of confounding factors. In contrast, the text categorization task allows much cleaner determination of text representation properties. In preparation for the use of text categorization to study text representation, a more effective and theoretically well-founded probabilistic text categorization algorithm was developed, building on work by Maron, Fuhr, and others. Text categorization experiments supported a number of predictions of the Concept Learning model about properties of phrasal representations, including dimensionality properties not previously measured for text representations. However, in carefully controlled experiments using syntactic phrases produced by Church's stochastic bracketer, in conjunction with reciprocal nearest neighbor clustering, term clustering was found to produce essentially no improvement in the properties of the phrasal representation. New cluster analysis approaches are proposed to remedy the problems found in traditional term clustering methods.
School Location:USA - Massachusetts
Source Type:Master's Thesis
Date of Publication:01/01/1992