Abstract (Summary)
Organizational decision-making involves accessing and integrating data that resides in various autonomous, localized databases managed by different parts of the organization. The data warehousing approach extracts data from different sources in advance, integrates it, and stores it at a centralized location to answer queries posed to support decision-making. These sets of data stored at the centralized location are known as materialized views and auxiliary data structures such as indexes can be built on these materialized views to speed up data retrieval. Since the amount of data available in the databases can be much larger than the available storage space in the warehouse, not all source data can be materialized. Materialized views are built on frequently posed queries and any other data is communicated from the sources when required. Identifying the data to be materialized as views and the indexes to be built on these views is a leading research issue in data warehousing. Due to the large search space for this problem, we explore the use of genetic algorithms (GAs) to select materialized views and indexes in a data warehouse. We seek to minimize query response time for a given workload while also considering a limit on the storage area. We employ the technique of multiple query processing (MQP) to build a multiple view processing plan (MVPP) of the frequently posed queries. The MVPPs are used to encode the view and index solution space of the GA. In order to study the performance of our GA, we compare our GA to another simple GA that selects only materialized views and not indexes for a same amount of storage area (VSnoISGA). The input for the algorithms is obtained from a testbed whose schema is the same as that of the industry standard TPC-H benchmark. The query set of our testbed is an extension of the TPC-H-SPJ query set, modified to contain shared data sets among the queries. The inputs for the algorithms are MVPPs built from subsets of queries of the testbed query set. We compare two important metrics to study the performance of our algorithm over VSnoISGA. The first metric is a measure of the computational resources consumed by the algorithms in terms of the CPU time. The second metric compares the quality of solution in terms of the sum of the query processing costs of the queries in the input MVPP. Based on our experimental results, we conclude that the selection of indexes is not worthwhile when the size of most of the indexes and corresponding views is greater than or equal to the available storage area.
Bibliographical Information:


School:University of Cincinnati

School Location:USA - Ohio

Source Type:Master's Thesis

Keywords:data warehousing materialized views genetic algorithms index selection space allocation


Date of Publication:01/01/2002

© 2009 All Rights Reserved.