D-CAPE -- A self-tuning continuous query plan distribution architecture
Abstract (Summary)
The study of systems for querying data streams, coined Data Stream Management
Systems (DSMS), has gained in popularity over the last several years. This new area of
research for the database community includes studies in areas such as Sensor Networks,
Network Intrusion, and monitoring data such as Medicine, Stock, or Weather feeds. With
this new popularity comes increased performance expectations, with increased data sizes
and speed and larger more complex query plans as well as high volumes of possibly small
queries. Due to the finite resources on a single query processor, future Data Stream Management
Systems must distribute their workload to multiple query processors, working
together in a synchronized manner.
This thesis discusses a new Distributed Continuous Query System (D-CAPE) developed
here at WPI that has the ability to distribute query plans over a large cluster of
machines. We describe the architecture of the new system, policies for query plan distribution
to improve overall performance, as well as techniques for self-tuning query plan
re-distribution. D-CAPE is designed to be as flexible as possible for future research. We
include a multi-tiered architecture that scales to a large number of query processors. D-
CAPE has also been designed to minimize the cost of the communications network by
bundling synchronization messages, thus minimizing packets sent between query processors.
These messages are also incremental at run-time to aid in minimizing the communication
cost of D-CAPE. The architecture allows for the flexible incorporation of different
distribution algorithms and operator reallocation policies.. D-CAPE provides an operator
reallocation algorithm that is able to seamlessly move an operator(s) across any query
processors in our computing cluster. We do so by creating “pipes” between query processors
to allow the data streams to flow, and then filling these pipes with data streams once
execution begins. Operator redistribution is accomplished by systematically reconnecting
these pipes as to not interrupt the data flow.
Experimental evaluation using our real prototype system (not just simulation) shows
that executing a query plan distributed over multiple machines causes no more overhead
than processing it on a single centralized query processor; even for rather lightly loaded
machines. Further, we find that distributing a query plan among a cluster of query processors
can boost performance up to twice that of a centralized DSMS. We conclude that
the limitation of each query processor within the distributed network of cooperating processors
is not primarily in the volume of the data nor the number of query operators, but
rather the number of data connections per processor and the allocation of the stateful and
thus most costly operators. We also find that the overhead of distributing query operators
is very low, allowing for a potentially frequent dynamic redistribution of query plans
during execution.
ii
Bibliographical Information:
Advisor:
School:Worcester Polytechnic Institute
School Location:USA - Massachusetts
Source Type:Master's Thesis
Keywords:query languages computer science streaming technology telecommunications data transmission systems
ISBN:
Date of Publication: