Maintenance of partial-sum-based histograms

by Kan, Kin-fai

Abstract (Summary)
(Uncorrected OCR) Abstract of thesis entitled "Maintenance of Partial-Sum-Based Histograms"

submitted by Kan Kin Fai

for the degree of Master of Philosophy at the University of Hong Kong

in December 2002

Wavelet-based histograms are one effective alternative to traditional partition-based histograms. They offer higher accuracy for selectivity estimation and can be naturally extended to the multi-dimensional case. They can be computed from either raw data distributions or partial sums. Both methods give good results, but the partial-sum-based method tends to work better for low dimensions and range queries. While wavelet-based histograms have several advantages, their maintenance is more difficult due to the nontrivial mathematical transformation. Previous works on the maintenance of wavelet-based histograms have only concentrated on the raw-data-based method.

The focus of this thesis is a study of the maintenance problem of wavelet-based histograms built on partial sums. Two approaches are suggested to compute the effects of data updates on the wavelet decomposition of partial sums. The naive approach first figures out the affected partial sums and then computes how the changes of those partial sums affect the wavelet decomposition of partial sums. The run-time of the naive approach depends on the number of affected partial sums and thus the domain size of the attribute. The fast approach is based on some nice properties of the wavelet decomposition of partial sums and its run-time only depends on the logarithm of the domain size of the attribute. With the fast approach, the semi-probabilistic-counting method (semi-PC) is proposed to maintain the set of significant wavelet coefficients stored in the histogram. The fast approach and the semi-PC method are also extended to the multi-dimensional case.

The performance of the algorithms was verified by experimental results. The efficiency of the naive approach and that of the fast approach were compared. The experimental results show that the fast approach is over 600 times more efficient than the naive approach

when the domain size is 4096 and the insertion sequence contains lOOK entries. When the domain size doubles, the run-time of the fast approach remains about the same whereas the run-time of the naive approach almost doubles. The accuracy of the semi-PC method was also evaluated using a wide range of datasets. The semi-PC method is compared with the static method and the ideal method. The static method just updates the wavelet coefficients in the histogram while the ideal method rebuilds the histogram from scratch upon any data update. The experimental results show that the semi-PC method outperforms the static method and is, in general, competitive with the ideal method.

Bibliographical Information:


School:The University of Hong Kong

School Location:China - Hong Kong SAR

Source Type:Master's Thesis

Keywords:database management distribution probability theory


Date of Publication:01/01/2003

© 2009 All Rights Reserved.