Evaluation of Text Classification Accuracy

by Howard, Bryan E

Abstract (Summary)
Libraries such as the National Library of Medicine frequently assign terms from a

controlled vocabulary to improve document retrieval. The high cost of such manual

efforts has motivated work in automated document classification. Framed using the

knowledge discovery process, this paper compares classification performance based on

various preprocessing, transformation and data mining methods. Specifically, we explore

the degree to which stemming, vocabulary selection using term weighting, and

windowing increases classification accuracy of the Naïve Bayes and J48 algorithms. We

find that a process using the Naïve Bayes algorithm with a stop list, removal of data

anomalies, TF*IDF weights in the range of 15 to 20, and a three word window size will

provide the highest classification accuracy.

Bibliographical Information:

Advisor:Catherine Blake

School:University of North Carolina at Chapel Hill

School Location:USA - North Carolina

Source Type:Master's Thesis

Keywords:text mining data knowledge discovery in databases naive bayes j48


Date of Publication:11/19/2007

© 2009 All Rights Reserved.