by Hsieh, Yu-Ming

Abstract (Summary)
Discovery of {it association rules} is an important problem in the area of data mining. Given a database of sales transactions, it is desirable to discover the important associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction. Since mining association rules may require to repeatedly scan through a large transaction database to find different association patterns, the amount of processing could be huge, and performance improvement is an essential concern. Among this problem, how to efficiently {it count large itemsets} is the major work, where a large itemset is a set of items appearing in a sufficient number of transactions. In this thesis, we propose efficient algorithms for mining association rules based on a high-level set-based approach. A set-based approach allows a clear expression of what needs to be done as opposed to specifying exactly how the operations are carried out in a low-level approach, where a low-level approach means to retrieve one tuple from the database at a time. The advantage of the set-based approach, like the SETM algorithm, is simple and stable over the range of parameter values. However, the SETM algorithm proposed by Houtsma and Swami may generate too many invalid candidate itemsets. Therefore, in this thesis, we propose a set-based algorithm called SETM*, which provides the same advantages of the SETM algorithm, while it avoids the disadvantages of the SETM algorithm. In the SETM* algorithm, we reduce the size of the candidate database by modifying the way of constructing it, where a candidate database is a transaction database formed with candidate $k$-itemsets. Then, based on the new way to construct the candidate database in the SETM* algorithm, we propose SETM*-2K, mbox{SETM*-MaxK} and SETM*-Lmax algorithms. In the SETM*-2K algorithm, given a $k$, we efficiently construct $L_{k}$ based on $L_{w}$, where $w=2^{lceil log_{2}k ceil - 1}$, instead of step by step. In the SETM*-MaxK algorithm, we efficiently to find the $L_{k}$ based on $L_{w}$, where $L_{k} ot= emptyset, L_{k+1}=emptyset$ and $w=2^{lceil log_{2}k ceil - 1}$, instead of step by step. In the SETM*-Lmax algorithm, we use a forward approach to find all maximal large itemsets from $L_{k}$, and the $k$-itemset is not included in the $k$-subsets of the $j$-itemset, except $k=MaxK$, where $1 leq k < j leq MaxK$, $L_{MaxK} ot= emptyset$ and $L_{MaxK+1}=emptyset$. We conduct several experiments using different synthetic relational databases. The simulation results show that the SETM* algorithm outperforms the SETM algorithm in terms of storage space or the execution time for all relational database settings. Moreover, we show that the proposed SETM*-2K and SETM*-MaxK algorithms also require shorter time to achieve their goals than the SETM or SETM* algorithms. Furthermore, we also show that the proposed forward approach (SETM*-Lmax) to find all maximal large itemsets requires shorter time than the backward approach proposed by Agrawal.
Bibliographical Information:

Advisor:Ye-In Chang; San-Yih Hwang; Tei-Wei Kuo; Chien-I Lee

School:National Sun Yat-Sen University

School Location:China - Taiwan

Source Type:Master's Thesis

Keywords:association rule data mining


Date of Publication:07/28/2000

© 2009 All Rights Reserved.