Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Performance Optimization Techniques and Tools for Data-Intensive Computation Platforms: An Overview of Performance Limitations in Big Data Systems and Proposed Optimizations
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.ORCID iD: 0000-0001-8219-4862
2014 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Big data processing has recently gained a lot of attention both from academia and industry. The term refers to tools, methods, techniques and frameworks built to collect, store, process and analyze massive amounts of data. Big data can be structured, unstructured or semi-structured. Data is generated from various different sources and can arrive in the system at various rates. In order to process these large amounts of heterogeneous data in an inexpensive and efficient way, massive parallelism is often used. The common architecture of a big data processing system consists of a shared-nothing cluster of commodity machines. However, even in such a highly parallel setting, processing is often very time-consuming. Applications may take up to hours or even days to produce useful results, making interactive analysis and debugging cumbersome.

One of the main problems is that good performance requires both good data locality and good resource utilization. A characteristic of big data analytics is that the amount of data that is processed is typically large in comparison with the amount of computation done on it. In this case, processing can benefit from data locality, which can be achieved by moving the computation close the to data, rather than vice versa. Good utilization of resources means that the data processing is done with maximal parallelization. Both locality and resource utilization are aspects of the programming framework’s runtime system. Requiring the programmer to work explicitly with parallel process creation and process placement is not desirable. Thus, specifying good optimization that would relieve the programmer from low-level, error-prone instrumentation to achieve good performance is essential.

The main goal of this thesis is to study, design and implement performance optimizations for big data frameworks. This work contributes methods and techniques to build tools for easy and efficient processing of very large data sets. It describes ways to make systems faster, by inventing ways to shorten job completion times. Another major goal is to facilitate the application development in distributed data-intensive computation platforms and make big-data analytics accessible to non-experts, so that users with limited programming experience can benefit from analyzing enormous datasets.

The thesis provides results from a study of existing optimizations in MapReduce and Hadoop related systems. The study presents a comparison and classification of existing systems, based on their main contribution. It then summarizes the current state of the research field and identifies trends and open issues, while also providing our vision on future directions.

Next, this thesis presents a set of performance optimization techniques and corresponding tools fordata-intensive computing platforms;

PonIC, a project that ports the high-level dataflow framework Pig, on top of the data-parallel computing framework Stratosphere. The results of this work show that Pig can highly benefit from using Stratosphereas the backend system and gain performance, without any loss of expressiveness. The work also identifies the features of Pig that negatively impact execution time and presents a way of integrating Pig with different backends.

HOP-S, a system that uses in-memory random sampling to return approximate, yet accurate query answers. It uses a simple, yet efficient random sampling technique implementation, which significantly improves the accuracy of online aggregation.

An optimization that exploits computation redundancy in analysis programs and m2r2, a system that stores intermediate results and uses plan matching and rewriting in order to reuse results in future queries. Our prototype on top of the Pig framework demonstrates significantly reduced query response times.

Finally, an optimization framework for iterative fixed points, which exploits asymmetry in large-scale graph analysis. The framework uses a mathematical model to explain several optimizations and to formally specify the conditions under which, optimized iterative algorithms are equivalent to the general solution.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2014. , 37 p.
Series
TRITA-ICT-ECS AVH, ISSN 1653-6363 ; 14:11
Keyword [en]
performance optimization, data-intensive computing, big data
National Category
Engineering and Technology
Research subject
Information and Communication Technology
Identifiers
URN: urn:nbn:se:kth:diva-145329ISBN: 978-91-7595-143-0 (print)OAI: oai:DiVA.org:kth-145329DiVA: diva2:717735
Presentation
2014-06-11, Sal D, KTH - ICT, Isafjordsgatan 39, Kista, 10:00 (English)
Opponent
Supervisors
Note

QC 20140605

Available from: 2014-06-05 Created: 2014-05-16 Last updated: 2014-06-05Bibliographically approved
List of papers
1. PonIC: Using Stratosphere to Speed Up Pig Analytics
Open this publication in new window or tab >>PonIC: Using Stratosphere to Speed Up Pig Analytics
2013 (English)In: Euro-Par 2013 Parallel Processing: 19th International Conference, Aachen, Germany, August 26-30, 2013. Proceedings, Springer Berlin/Heidelberg, 2013, 279-290 p.Conference paper, Published paper (Refereed)
Abstract [en]

Pig, a high-level dataflow system built on top of Hadoop MapReduce, has greatly facilitated the implementation of data-intensive applications. Pig successfully manages to conceal Hadoop’s one input and two-stage inflexible pipeline limitations, by translating scripts into MapReduce jobs. However, these limitations are still present in the backend, often resulting in inefficient execution.Stratosphere, a data-parallel computing framework consisting of PACT, an extension to the MapReduce programming model and the Nephele execution engine, overcomes several limitations of Hadoop MapReduce. In this paper, we argue that Pig can highly benefit from using Stratosphere as the backend system and gain performance, without any loss of expressiveness.We have ported Pig on top of Stratosphere and we present a process for translating Pig Latin scripts into PACT programs. Our evaluation shows that Pig Latin scripts can execute on our prototype up to 8 times faster for a certain class of applications.

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2013
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 8097
Keyword
big data, data analytics, programming systems
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-129072 (URN)10.1007/978-3-642-40047-6_30 (DOI)000341243100030 ()2-s2.0-84883160941 (Scopus ID)978-3-642-40046-9 (ISBN)
Conference
19th International Conference on Parallel Processing, Euro-Par 2013; Aachen, Germany, 26 August - 30 August 2013
Projects
SSF project End-to-End Clouds (E2E Cloouds)Erasmus Mundus Joint Doctorate in Distributed Computing (EMJD-DC)
Funder
Swedish Foundation for Strategic Research , RIT10-0043
Note

QC 20131017

Available from: 2013-09-19 Created: 2013-09-19 Last updated: 2014-10-03Bibliographically approved
2. MapReduce: Limitations, optimizations and open issues
Open this publication in new window or tab >>MapReduce: Limitations, optimizations and open issues
2013 (English)In: Proceedings - 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2013, IEEE , 2013, 1031-1038 p.Conference paper, Published paper (Refereed)
Abstract [en]

MapReduce has recently gained great popularity as a programming model for processing and analyzing massive data sets and is extensively used by academia and industry. Several implementations of the MapReduce model have emerged, the Apache Hadoop framework being the most widely adopted. Hadoop offers various utilities, such as a distributed file system, job scheduling and resource management capabilities and a Java API for writing applications. Hadoop's success has intrigued research interest and has led to various modifications and extensions to the framework. Implemented optimizations include performance improvements, programming model extensions, tuning automation and usability enhancements. In this paper, we discuss the current state of the Hadoop framework and its identified limitations. We present, compare and classify Hadoop/MapReduce variations, identify trends, open issues and possible future directions.

Place, publisher, year, edition, pages
IEEE, 2013
Series
IEEE International Conference on Trust Security and Privacy in Computing and Communications, ISSN 2324-898X
Keyword
Big Data, MapReduce, Survey
National Category
Information Systems
Identifiers
urn:nbn:se:kth:diva-143846 (URN)10.1109/TrustCom.2013.126 (DOI)000332856700131 ()2-s2.0-84893439928 (Scopus ID)978-076955022-0 (ISBN)
Conference
12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2013; Melbourne, VIC; Australia; 16 July 2013 through 18 July 2013
Note

QC 20140415

Available from: 2014-04-15 Created: 2014-03-31 Last updated: 2014-06-05Bibliographically approved
3. m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data
Open this publication in new window or tab >>m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data
2013 (English)Conference paper, Published paper (Refereed)
Abstract [en]

High-level parallel dataflow systems, such as Pig and Hive, have lately gained great popularity in the area of big data processing. These systems often consist of a declarative query language and a set of compilers, which transform queries into execution plans and submit them to a distributed engine for execution. Apart from the useful abstraction and support for common analysis operations, high-level processing systems also offer great opportunities for automatic optimizations. Existing studies on execution traces from big data centers and industrial clusters show that there is significant computation redundancy in analysis programs, i.e., there exist similar or even identical queries on the same datasets in different jobs. Furthermore, workload characterization of MapReduce traces from large organizations suggest that there is a big need for caching job results, that will enable their reuse and improve execution time. In this paper, we propose m2r2, an extensible and language-independent framework for results materialization and reuse in high-level dataflow systems for big data analytics. Our prototype implementation is built on top of the Pig dataflow system and handles automatic results caching, common sub-query matching and rewriting, as well as garbage collection. We have evaluated m2r2 using the TPC-H benchmark for Pig and report reduced query execution time by 65% on average.

Keyword
computation redundancies, materialization, results reuse
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-145334 (URN)10.1109/CSE.2013.134 (DOI)000351950300126 ()2-s2.0-84900371928 (Scopus ID)
Conference
2nd International Conference on Big Data Science and Engineering (BDSE 2013)
Note

QC20140624

Available from: 2014-05-16 Created: 2014-05-16 Last updated: 2014-06-24Bibliographically approved
4. Asymmetry in Large-Scale Graph Analysis, Explained
Open this publication in new window or tab >>Asymmetry in Large-Scale Graph Analysis, Explained
Show others...
2014 (English)In: Proceedings of the Second International Workshop on Graph Data ManagementExperience and Systems (GRADES 2014), June 22, 2014, Snowbird, Utah, USA., 2014Conference paper, Published paper (Refereed)
Abstract [en]

Iterative computations are in the core of large-scale graph processing. In these applications, a set of parameters is continuously refined, until a fixed point is reached. Such fixed point iterations often exhibit non-uniform computational behavior, where changes propagate with different speeds throughout the parameter set, making them active or inactive during iterations. This asymmetrical behavior can lead to a many redundant computations, if not exploited. Many specialized graph processing systems and APIs exist that run iterative algorithms efficiently exploiting this asymmetry. However, their functionality is sometimes vaguely defined and due to their different programming models and terminology used, it is often challenging to derive equivalence between them. We describe an optimization framework for iterative graph processing, which utilizes dataset dependencies. We explain several optimization techniques that exploit asymmetrical behavior of graph algorithms. We formally specify the conditions under which, an algorithm can use a certain technique. We also design template execution plans, using a canonical set of dataflow operators and we evaluate them using real-world datasets and applications. Our experiments show that optimized plans can significantly reduce execution time, often by an order of magnitude. Based on our experiments, we identify a trade-off that can be easily captured and could serve as the basis for automatic optimization of large-scale graph-processing applications.

National Category
Engineering and Technology Computer Systems
Identifiers
urn:nbn:se:kth:diva-145335 (URN)2-s2.0-84905640971 (Scopus ID)
Conference
2nd International Workshop on Graph Data Management Experiences and Systems, GRADES 2014 - Co-located with SIGMOD/PODS 2014; Snowbird, UT, United States, 22 June 2014 - 27 June 2014
Note

QC 20150309

Available from: 2014-05-16 Created: 2014-05-16 Last updated: 2017-04-28Bibliographically approved
5. Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Open this publication in new window or tab >>Block Sampling: Efficient Accurate Online Aggregation in MapReduce
2013 (English)In: Cloud Computing Technology and Science (CloudCom), 2013 IEEE 5th International Conference on, 2013, 250-257 p.Conference paper, Published paper (Refereed)
Abstract [en]

Large-scale data processing frameworks, such as Hadoop MapReduce, are widely used to analyze enormous amounts of data. However, processing is often time-consuming, preventing interactive analysis. One way to decrease response time is partial job execution, where an approximate, early result becomes available to the user, prior to job completion. The Hadoop Online Prototype (HOP) uses online aggregation to provide early results, by partially executing jobs on subsets of the input, using a simplistic progress metric. Due to its sequential nature, values are not objectively represented in the input subset, often resulting in poor approximations or "data bias". In this paper, we propose a block sampling technique for large-scale data processing, which can be used for fast and accurate partial job execution. Our implementation of the technique on top of HOP uniformly samples HDFS blocks and uses in-memory shuffling to reduce data bias. Our prototype significantly improves the accuracy of HOP's early results, while only introducing minimal overhead. We evaluate our technique using real-world datasets and applications and demonstrate that our system outperforms HOP in terms of accuracy. In particular, when estimating the average temperature of the studied dataset, our system provides high accuracy (less than 20% absolute error) after processing only 10% of the input, while HOP needs to process 70% of the input to yield comparable results.

Keyword
MapReduce, online aggregation, sampling, approximate results
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-145332 (URN)10.1109/CloudCom.2013.40 (DOI)000352075800035 ()2-s2.0-84899738915 (Scopus ID)
Conference
5th IEEE International Conference on Cloud Computing Technology and Science (CloudCom) 2013
Note

QC 20140624

Available from: 2014-05-16 Created: 2014-05-16 Last updated: 2014-06-24Bibliographically approved

Open Access in DiVA

VKalavri_LicThesis.pdf(4943 kB)1225 downloads
File information
File name FULLTEXT01.pdfFile size 4943 kBChecksum SHA-512
4e752f912f3ba5ff031b17978ab025d9f90449b794164a7f57763c70eaa642273b6ed7b4ba055010dd2a3a30bd38c89b4ec86efc50875b50e14e2a917744f765
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Kalavri, Vasiliki
By organisation
Software and Computer systems, SCS
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 1225 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1045 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf