An efficient and incremental system to mine contiguous frequent sequences
Abstract (Summary)
Mining frequent patterns is an important component of many prediction systems. One
common usage in web applications is the mining of users’ access behavior for the purpose
of predicting and hence pre-fetching the web pages that the user is likely to visit.
Frequent sequence mining approaches in the literature are often based on the use of
an Apriori-like candidate generation strategy, which typically requires numerous scans of
a potentially huge sequence database. In this paper we instead introduce a more efficient
strategy for discovering frequent patterns in sequence databases that requires only two
scans of the database. The first scan obtains support counts for subsequences of length
two. The second scan extracts potentially frequent sequences of any length and represents
them as a compressed frequent sequences tree structure (FS-tree). Frequent sequence patterns
are then mined from the FS-tree. Incremental and interactive mining functionalities
are also facilitated by the FS-tree. As part of this work, we developed the FS-Miner, an
system that discovers frequent sequences from web log files. The FS-Miner has the ability
to adapt to changes in users’ behavior over time, in the form of new input sequences, and
to respond incrementally without the need to perform full re-computation. Our system
also allows the user to change the input parameters (e.g., minimum support and desired
pattern size) interactively without requiring full re-computation in most cases.
We have tested our system using two different data sets, comparing it against two other
algorithms from the literature. Our experimental results show that our system scales up
linearly with the size of the input database. Furthermore, it exhibits excellent adaptability
to support threshold decreases. We also show that the incremental update capability of
the system provides significant performance advantages over full re-computation even for
relatively large update sizes.
Bibliographical Information:
Advisor:
School:Worcester Polytechnic Institute
School Location:USA - Massachusetts
Source Type:Master's Thesis
Keywords:data mining world wide web internet searching
ISBN:
Date of Publication: