Change search
ReferencesLink to record
Permanent link

Direct link
Scalable Parallelization of Expensive Continuous Queries over Massive Data Streams
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computing Science. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computing Science. (UDBL)
2011 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Numerous applications in for example science, engineering, and financial analysis increasingly require online analysis over streaming data. These data streams are often of such a high rate that saving them to disk is not desirable or feasible. Therefore, search and analysis must be performed directly over the data in motion. Such on-line search and analysis can be expressed as continuous queries (CQs) that are defined over the streams. The result of a CQ is a stream itself, which is continuously updated as new data appears in the queried stream(s). In many cases, the applications require non-trivial analysis, leading to CQs involving expensive processing. To provide scalability of such expensive CQs over high-volume streams, the execution of the CQs must be parallelized.

In order to investigate different approaches to parallel execution of CQs, a parallel data stream management system called SCSQ was implemented for this Thesis. Data and queries from space physics and traffic management applications are used in the evaluations, as well as synthetic data and the standard data stream benchmark; the Linear Road Benchmark. Declarative parallelization functions are introduced into the query language of SCSQ, allowing the user to specify customized parallelization. In particular, declarative stream splitting functions are introduced, which split a stream into parallel sub-streams, over which expensive CQ operators are continuously executed in parallel.

Naïvely implemented, stream splitting becomes a bottleneck if the input streams are of high volume, if the CQ operators are massively parallelized, or if the stream splitting conditions are expensive. To eliminate this bottleneck, different approaches are investigated to automatically generate parallel execution plans for stream splitting functions. This Thesis shows that by parallelizing the stream splitting itself, expensive CQs can be processed at stream rates close to network speed. Furthermore, it is demonstrated how parallelized stream splitting allows orders of magnitude higher stream rates than any previously published results for the Linear Road Benchmark.

 

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2011. , 35 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 836
National Category
Computer Science
Research subject
Computer Science with specialization in Database Technology
Identifiers
URN: urn:nbn:se:uu:diva-152255ISBN: 978-91-554-8095-0OAI: oai:DiVA.org:uu-152255DiVA: diva2:413226
Public defence
2011-09-20, Auditorium Minus, Museum Gustavianum, Akademigatan 3, Uppsala, 13:15 (English)
Opponent
Supervisors
Available from: 2011-06-10 Created: 2011-04-27 Last updated: 2014-07-21
List of papers
1. Processing High-Volume Stream Queries on a Supercomputer
Open this publication in new window or tab >>Processing High-Volume Stream Queries on a Supercomputer
2006 (English)In: Processing High-Volume Stream Queries on a Supercomputer, 2006, 147- p.Conference paper (Refereed)
Abstract [en]

Scientific instruments, such as radio telescopes, colliders, sensor networks, and simulators generate very high volumes of data streams that scientists analyze to detect and understand physical phenomena. The high data volume and the need for advanced computations on the streams require substantial hardware resources and scalable stream processing. We address these challenges by developing data stream management technology to support high-volume stream queries utilizing massively parallel computer hardware. We have developed a data stream management system prototype for state-of-the-art parallel hardware. The performance evaluation uses real measurement data from LOFAR, a radio telescope antenna array being developed in the Netherlands.

National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-20450 (URN)0-7695-2571-7 (ISBN)
Available from: 2006-12-22 Created: 2006-12-22 Last updated: 2011-07-04
2. Using stream queries to measure communication performance of a parallel computing environment
Open this publication in new window or tab >>Using stream queries to measure communication performance of a parallel computing environment
2007 (English)In: Proc. 1st International Workshop on Distributed Event Processing, Systems and Applications (DEPSA’07), 2007Conference paper (Refereed)
National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-10187 (URN)
Available from: 2008-02-04 Created: 2008-02-04 Last updated: 2011-07-04Bibliographically approved
3. Highly Scalable Trip Grouping for Large Scale Collective Transportation Systems
Open this publication in new window or tab >>Highly Scalable Trip Grouping for Large Scale Collective Transportation Systems
2008 (English)In: Proc. 11th International Conference on Extending Database Technology, EDBT 2008, 2008Conference paper (Refereed)
Abstract [en]

Transportation–related problems, like road congestion, park-ing, and pollution are increasing in most cities. In order toreduce traffic, recent work has proposed methods for vehiclesharing, for example for sharing cabs by grouping “closeby”cab requests and thus minimizing transportation cost andutilizing cab space. However, the methods proposed so fardo not scale to large data volumes, which is necessary tofacilitate large scale collective transportation systems, e.g.,ride–sharing systems for large cities.This paper presents highly scalable “trip grouping” algo-rithms, that generalize previous techniques and support in-put rates that can be orders of magnitude larger. The follow-ing three contributions make the grouping algorithms scal-able. First, the basic grouping algorithm is expressed as acontinuous stream query in a data stream management sys-tem to allow for very large flows of requests. Second, follow-ing the divide–and–conquer paradigm, four space–partition-ing policies for dividing the input data stream into sub–streams are developed and implemented using continuousstream queries. Third, using the partitioning policies, par-allel implementations of the grouping algorithm in a paral-lel computing environment are described. Extensive experi-mental results show that the parallel implementation usingsimple adaptive partitioning methods can achieve speed–upsof several orders of magnitudes without significantly effect-ing the quality of the grouping.

National Category
Computer Science Computer Science
Identifiers
urn:nbn:se:uu:diva-103534 (URN)
Projects
iStreams
Available from: 2009-05-19 Created: 2009-05-19 Last updated: 2011-07-04Bibliographically approved
4. Scalable Splitting of Massive Data Streams
Open this publication in new window or tab >>Scalable Splitting of Massive Data Streams
2010 (English)In: Database Systems for Advanced Applications: Part II / [ed] Kitagawa H., Ishikawa Y., Li Q., Watanabe C., Berlin: Springer-Verlag , 2010, 184-198 p.Conference paper (Refereed)
Abstract [en]

Scalable execution of continuous queries over massive data streams often requires splitting input streams into parallel sub-streams over which query operators are executed in parallel. Automatic stream splitting is in general very difficult, as the optimal parallelization may depend on application semantics. To enable application specific stream splitting, we introduce splitstream functions where the user specifies non-procedural stream partitioning and replication. For high-volume streams, the stream splitting itself becomes a performance bottleneck. A cost model is introduced that estimates the performance of splitstream functions with respect to throughput and CPU usage. We implement parallel splitstream functions, and relate experimental results to cost model estimates. Based on the results, a splitstream function called autosplit is proposed, which scales well for high degrees of parallelism, and is robust for varying proportions of stream partitioning and replication. We show how user defined parallelization using autosplit provides substantially improved scalability (L = 64) over previously published results for the Linear Road Benchmark.

Place, publisher, year, edition, pages
Berlin: Springer-Verlag, 2010
Series
, Lecture Notes in Computer Science, ISSN 0302-9743 ; 5982
Keyword
distributed stream systems, parallelization, query optimization
National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-136403 (URN)10.1007/978-3-642-12098-5_15 (DOI)000278934800015 ()978-3-642-12097-8 (ISBN)
Conference
15th International Conference, DASFAA 2010, Tsukuba, Japan, April 1-4, 2010
Projects
iStreamseSSENCE
Available from: 2010-12-14 Created: 2010-12-13 Last updated: 2011-07-04Bibliographically approved
5. Massive scale-out of expensive continuous queries
Open this publication in new window or tab >>Massive scale-out of expensive continuous queries
2011 (English)In: 36th International Conference on Very Large Data Bases: VLDB 2010, 2011Conference paper (Refereed)
Abstract [en]

Scalable execution of expensive continuous queries over massive data streams requires input streams to be split into parallel sub-streams. The query operators are continuously executed in parallel over these sub-streams. Stream splitting involves both partitioning and replication of incoming tuples, depending on how the continuous query is parallelized. We provide a stream splitting operator that enables such customized stream splitting. However, it is critical that the stream splitting itself keeps up with input streams of high volume. This is a problem when the stream splitting predicates have some costs. Therefore, to enable customized splitting of high-volume streams, we introduce a parallelized stream splitting operator, called parasplit. We investigate the performance of parasplit using a cost model and experimentally. Based on these results, a heuristic is devised to automatically parallelize the execution of parasplit. We show that the maximum stream rate of parasplit approaches network speed, and that the parallelization is resource efficient. Finally, the scalability of our approach is experimentally demonstrated on the Linear Road Benchmark, showing an order of magnitude higher stream processing rate over previously published results, allowing at least 512 expressways.

National Category
Computer Science Computer Science
Research subject
Computer Science with specialization in Database Technology
Identifiers
urn:nbn:se:uu:diva-152251 (URN)
Projects
iStreams
Available from: 2011-04-27 Created: 2011-04-27 Last updated: 2011-07-04Bibliographically approved

Open Access in DiVA

fulltext(1686 kB)637 downloads
File information
File name FULLTEXT01.pdfFile size 1686 kBChecksum SHA-512
a97fd41d58ad6125b2bf3c544b4ab09f76cfe6f5d2c49ae99927d47defda96431b78534ffa7d21c6a1028de1b944d60b8ab9ae4be5a50438d938c43a791477a2
Type fulltextMimetype application/pdf
Buy this publication >>

Search in DiVA

By author/editor
Zeitler, Erik
By organisation
Division of Computing ScienceComputing Science
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 637 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 1361 hits
ReferencesLink to record
Permanent link

Direct link