A probabilistic description-oriented approach for categorising Web documents
Abstract (Summary)
The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are highly heterogeneous. Two ways to respond to this challenge are (1) to use a representation of the content of web documents that captures these two characteristics and (2) to use more effective classifiers. Our categorisation approach is based on a probabilistic description-oriented representation of web documents, and a probabilistic interpretation of the k-nearest neighbour classifier. With the former, we provide an enhanced document representation that incorporates the structural and heterogeneous nature of web documents. With the latter, we provide a theoretical sound justification for the various parameters of k-nearest neighbour classifier. Experimental results show that (1) using an enhanced representation of web documents is crucial for an effective categorisation of web documents, and (2) a theoretical interpretation of the k-nearest neighbour classifier gives us improvement over the standard k-nearest neighbour classifier.
Bibliographical Information:
Advisor:none
School:Universität Duisburg-Essen, Standort Essen
School Location:Germany
Source Type:Master's Thesis
Keywords:informatik datenverarbeitung universitaet duisburg essen
ISBN:
Date of Publication:04/23/2004