Maintenance of partial-sum-based histograms
submitted by Kan Kin Fai
for the degree of Master of Philosophy at the University of Hong Kong
in December 2002
Wavelet-based histograms are one effective alternative to traditional partition-based histograms. They offer higher accuracy for selectivity estimation and can be naturally extended to the multi-dimensional case. They can be computed from either raw data distributions or partial sums. Both methods give good results, but the partial-sum-based method tends to work better for low dimensions and range queries. While wavelet-based histograms have several advantages, their maintenance is more difficult due to the nontrivial mathematical transformation. Previous works on the maintenance of wavelet-based histograms have only concentrated on the raw-data-based method.
The focus of this thesis is a study of the maintenance problem of wavelet-based histograms built on partial sums. Two approaches are suggested to compute the effects of data updates on the wavelet decomposition of partial sums. The naive approach first figures out the affected partial sums and then computes how the changes of those partial sums affect the wavelet decomposition of partial sums. The run-time of the naive approach depends on the number of affected partial sums and thus the domain size of the attribute. The fast approach is based on some nice properties of the wavelet decomposition of partial sums and its run-time only depends on the logarithm of the domain size of the attribute. With the fast approach, the semi-probabilistic-counting method (semi-PC) is proposed to maintain the set of significant wavelet coefficients stored in the histogram. The fast approach and the semi-PC method are also extended to the multi-dimensional case.
The performance of the algorithms was verified by experimental results. The efficiency of the naive approach and that of the fast approach were compared. The experimental results show that the fast approach is over 600 times more efficient than the naive approach
when the domain size is 4096 and the insertion sequence contains lOOK entries. When the domain size doubles, the run-time of the fast approach remains about the same whereas the run-time of the naive approach almost doubles. The accuracy of the semi-PC method was also evaluated using a wide range of datasets. The semi-PC method is compared with the static method and the ideal method. The static method just updates the wavelet coefficients in the histogram while the ideal method rebuilds the histogram from scratch upon any data update. The experimental results show that the semi-PC method outperforms the static method and is, in general, competitive with the ideal method.
School:The University of Hong Kong
School Location:China - Hong Kong SAR
Source Type:Master's Thesis
Keywords:database management distribution probability theory
Date of Publication:01/01/2003