Techniques for managing and analyzing unconventional data
Abstract of thesis entitled
Techniques for Managing and Analyzing Unconventional Data
Ho Wai Shing
for the degree of Doctor of Philosophy
at the University of Hong Kong
in August 2004
With the explosive growth of information technology, many new applications are developed and a huge volume of data is being collected or generated. Some data in these new applications differ significantly from traditional relational data. They may fit a different data model or show different data characteristics. Such data can be described as unconventional data. Traditional database management and analysis techniques, which mainly focus on relational data, may not be applicable to unconventional data. Therefore, new techniques are required for this kind of new data. This study considers the techniques for managing and analyzing some commonly available unconventional data through four examples.
XML documents, which have a graph-like data model, are very different from traditional relational data. Evaluating queries on the structure of XML documents is an important task in managing this kind of data. The selectivity of simple path
expressions (SPEs) is one of the important statistics, which provides information for a query optimizer to find the best evaluation plan for those queries. A new data structure called SF-Tree is proposed to store such statistics in a flexible, efficient and accurate manner.
Another kind of unconventional data are product pages of e-commerce web sites. Managing the product pages into a good organization which allows visitors to easily access the information they want is crucial to the success of those sites. A new metric, based on the extra information in the product pages including their popularity and the unimportance of their attributes, is proposed to quantify the quality of a web site. GENCAT, an efficient greedy algorithm, is proposed for building a good catalog organization from a set of product pages from scratch automatically.
Unconventional data may show different data characteristics from traditional relational data. In a set of labeled data where the class of interest is minor, traditional analysis skills which focus on significant classes are not applicable. Three efficient algorithms are therefore proposed to find maximal significant probable summarizations (MSPSs), which can be regarded as a concise summary of the tuples of the class of interest, in such data.
In some applications, data may show unconventional characteristics. Those data may be transmitted in continuous, and sometimes bursty, streams. Traditional data analysis tools, which focus on analyzing persistent sets of data, are not applicable to data streams. An architecture for finding frequent itemsets over a bursty data stream is therefore discussed. The architecture provides a feedback mechanism enabling the system to adapt to a bursty environment.
These new techniques are evaluated by experiments and the results show that they can manage and analyze unconventional data efficiently and effectively, and can outperform traditional methods in such applications.
School:The University of Hong Kong
School Location:China - Hong Kong SAR
Source Type:Master's Thesis
Keywords:xml document markup language data mining database management
Date of Publication:01/01/2005