Evaluation of Text Classification Accuracy
controlled vocabulary to improve document retrieval. The high cost of such manual
efforts has motivated work in automated document classification. Framed using the
knowledge discovery process, this paper compares classification performance based on
various preprocessing, transformation and data mining methods. Specifically, we explore
the degree to which stemming, vocabulary selection using term weighting, and
windowing increases classification accuracy of the Naïve Bayes and J48 algorithms. We
find that a process using the Naïve Bayes algorithm with a stop list, removal of data
anomalies, TF*IDF weights in the range of 15 to 20, and a three word window size will
provide the highest classification accuracy.
Advisor:Catherine Blake
School:University of North Carolina at Chapel Hill
School Location:USA - North Carolina
Source Type:Master's Thesis
Keywords:text mining data knowledge discovery in databases naive bayes j48
ISBN:
Date of Publication:11/19/2007