Sequence mining algorithms
Abstract of thesis entitled "Sequence Mining Algorithms"
Submitted by Minghua ZHANG
for the degree of Doctor of Philosophy
at The University of Hong Kong
in August 2004
This thesis studies three related problems on mining patterns from sequences in the area of data mining. The three problems are: (1) mining frequent sequences from a transactional database; (2) incremental update of frequent sequences when the underlying database changes over time; and (3) mining frequently occurring periodic patterns with a gap requirement from sequences.
The first problem is to mine frequent sequences from a transactional database. A transactional database is composed of a set of sequences. Each sequence is a chronologically ordered list of transactions. The goal of this mining problem is to extract frequently occurring sequences in the database. Representative memory-efficient algorithms for this problem include GSP. We discuss the high I/O cost of GSP, particularly when the database contains long frequent sequences. To reduce the I/O requirement of GSP, we propose a new algorithm called MFS. The general strategy of MFS is to first find an approximate solution to the set of frequent sequences and then perform successive refinement until the exact set of frequent sequences is obtained. Experiments show that this successive refinement approach results in a significant improvement in I/O cost.
The second problem we study is the incremental update of frequent sequences. We discuss how MFS can be applied to the incremental update problem. In particular, the result of a previous mining exercise can be used (by MFS) as a good initial approximate solution for the mining of an updated database. This
results in an I/O efficient algorithm. To improve processing efficiency, we devise pruning techniques that, when coupled with GSP or MFS, result in two algorithms, GSP+ and MFS+, that are both CPU and I/O efficient. We also compare our incremental algorithms GSP+ and MFS+ with another incremental algorithm ISM. The comparison allows us to choose a suitable algorithm for solving the incremental update problem given a particular application with a specific dataset characteristic and a particular computing environment.
The third problem is to mine periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length I, a pattern P is regarded as a frequently occurring pattern in S if the probability of observing P given a randomly picked length-Z subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of this mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem and study their characteristics. We also present a case study in which we apply our algorithm to real DNA sequences.
School:The University of Hong Kong
School Location:China - Hong Kong SAR
Source Type:Master's Thesis
Keywords:data mining computer algorithms
Date of Publication:01/01/2005