Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Techniques and applications of earlyapproximate results for big-dataanalytics
KTH, School of Information and Communication Technology (ICT).
2013 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The amount of data processed by large-scale data processing frameworks is overwhelming. To improve the efficiency, such frameworks employ data parallelization and similar techniques. However, the expectations are growing: near real-time data analysis is desired. MapReduce is one of the most common large-scale data processing models in the area. Due to the batch processing nature of this framework, results are returned after job execution is finished. With the growth of data, batch operating environment is not always preferred: large number of applications can take advantage of early approximate results. It was first addressed by the online aggregation technique, applied to the relational databases. Recently it has been adapted for the MapReduce programming model, but with a focus to technical rather than data processing details. In this thesis project we overview the techniques, which can enable early estimation of results. We propose several modifications of the MapReduce Online framework. We show that our proposed system design changes possess properties required for the accurate results estimation. We present an algorithm for data bias reduction and block-level sampling. Consequently, we describe the implementation of our proposed system design and evaluate it with a number of selected applications and datasets. With our system, a user can calculate the average temperature of the 100 GB weather dataset six times faster (in comparison to the complete job execution) with as low as 2% error.

Place, publisher, year, edition, pages
2013. , 74 p.
Series
TRITA-ICT-EX, 2013:185
National Category
Computer and Information Science
Identifiers
URN: urn:nbn:se:kth:diva-141655OAI: oai:DiVA.org:kth-141655DiVA: diva2:697407
Educational program
Master of Science - Distributed Computing
Examiners
Available from: 2014-02-20 Created: 2014-02-18 Last updated: 2014-02-20Bibliographically approved

Open Access in DiVA

fulltext(9427 kB)138 downloads
File information
File name FULLTEXT01.pdfFile size 9427 kBChecksum SHA-512
3aaad2053b3a911c720cbce8232617c2c45abd3d5ecb9a24a71511b7b857f7ed4676139994aa782abe2fe5dffdc037f8be511540b26e9591dda9e55d774aad34
Type fulltextMimetype application/pdf

By organisation
School of Information and Communication Technology (ICT)
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 138 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 123 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf