The dream of computing power as readily available as the electricity in a wall socket is coming closer to reality with the arrival of grid and cloud computing. At the sametime, databases grow to sizes beyond what can be efficiently managed by single server systems. There is a need for efficient distributed database management systems (DBMSs). Current distributed DBMSs are not built to scale to more than tensor hundreds of sites (i.e., nodes or computers). Users of grid and cloud computingexpect not only almost innite scalability, i.e., at least to thousands of sites, but alsothat the scale is adapted automatically to meet the demand, whether it increases or decreases. This is a challenge to current distributed DBMSs.
In this thesis, the focus is on how to improve performance of query processingin large distributed DBMSs where coordination between sites has been reduced inorder to increase scalability. The challenge is for the sites to make decisions thatare globally benecial when their view of the complete system is limited. The main contributions of this thesis are methods to increase failure resilience of aggregation queries, adaptively place data on dierent sites and locate these sites afterwards,and cache intermediate results of query processing.
The study of failure resilience in aggregation queries presented in this thesisshows that dierent aggregation functions react dierently to failures and that countermeasures must be adapted to each function. A low-cost method to increase accuracyis proposed.
The dynamic data placement method presented in this thesis allows data to befragmented, allocated, and replicated to adapt to the current system conguration and workload. Fragments are split, coalesced, reallocated, and replicated during query processing to improve query processing performance by allowing more data to be accessed locally. The proposed look up method uses range indexing to make it possible to efficiently identify the sites that store relevant data for a query with low overhead when data is updated.
During query execution, a number of intermediate results are produced, and this thesis proposes a method to cache these results and use them to answer other,similar queries. In particular, a caching method to improve execution times of top-kqueries is presented.
Results of experiments in simulators and on an implementation in the DASCOSADB distributed DBMS prototype show that these methods lead to signicant savings in query execution time.
Tapir Uttrykk , 2011.