Abstract (Summary)
Organizations are increasingly experiencing the necessity and benefits of integrated access to multiple data sources. Database integration has two aspects: schema integration and data integration. Schema integration arrives at a common schema representing the elements of the source schemas. Data integration involves detecting and merging multiple instances of the same real world entities from different databases. Entity identification is necessary when there is no common means of identification such as primary keys, and it is usually solved manually. This thesis focuses on solving the entity identification problem in an automated way using data mining techniques. We use automated learning techniques to identify characteristics or patterns found in entities and apply this knowledge to detect multiple instances of the same entity. The data mining techniques that we use are decision trees and k-nearest neighbors (k-NN). Our approach preprocesses the data before employing the data mining techniques. The preprocessing forms clusters on the data and entity identification is performed on each cluster. To study the performance of the proposed algorithms, we use a small database of 2500 records and vary different parameters such as training set size and number of unique entities in our experiments. Our experiments study the impact of our preprocessing algorithm on both a decision tree implementation and a k-NN implementation as the classification techniques. We examine whether accuracy and processing speed are improved, unaffected or adversely affected. For our testbed, there is a significant savings in the processing time of the clustered data sets with decision trees when compared to the unclustered data sets with decision trees for both small and large training set sizes. On the other hand, the accuracy when using clustering is always less than that obtained without clustering, but the clustering accuracy approaches the accuracy of the non-clustered approach as the number of unique entities increases. Clustering errors do not significantly affect the accuracy of any of the classification techniques for any data set (clustered or unclustered). On clustered data sets, the processing time is always less with decision tree techniques than k-NN; however the difference in the processing time between the k-NN and the decision tree technique decreases with decrease in training set size. The decision tree technique gives better accuracy than the k-NN technique in all cases except when applied on the data sets with a small number of unique entities for a small training set size.
Bibliographical Information:


School:University of Cincinnati

School Location:USA - Ohio

Source Type:Master's Thesis

Keywords:entity identification data mining


Date of Publication:01/01/2001

© 2009 All Rights Reserved.