Data cleaning techniques by means of entity resolution
Abstract (Summary)
Real data are “dirty.” Despite active research on integrity constraints enforcement
and data cleaning, real data in real database applications are still dirty. To make
matters worse, both diverse formats/usages of modern data and demands for largescale
data handling make this problem even harder. In particular, to surmount
the challenges for which conventional solutions against this problem no longer
work, we focus on one type of problems known as the Entity Resolution (ER) –
the process of identifying and merging duplicate entities determined to represent
the same real-world object. Despite the fact that the problem has been studied
extensively, it is still not trivial to de-duplicate complex entities among a large
number of candidates.
In this thesis, we have studied three specialized types of ER problems: (1) the
Split Entity Resolution (SER) problem, in which instances of the same entity type
mistakenly appear under different name variants; (2) the Mixed Entity Resolution
(MER) problem, in which instances of different entities appear together for their
homonymous names; and (3) the Grouped Entity Resolution (GER) problem, in
which instances of entities do not carry any name or description by which ER
techniques can be utilized, and thus the contents of entities are exploited as a
group of elements. For each type of problems, we have developed a novel scalable
solution. Especially, for the GER problem, we have developed two graph theoretic
algorithms - one based on Quasi-Clique and the other based on Bipartite Matching,
and experimentally validate the superiority of the proposed solutions.
iii
Bibliographical Information:
Advisor:
School:Pennsylvania State University
School Location:USA - Pennsylvania
Source Type:Master's Thesis
Keywords:
ISBN:
Date of Publication: