Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.ORCID iD: 0000-0002-7510-6286
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.ORCID iD: 0000-0002-9637-2065
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
Technical University of Catalunya, Barcelona Super Computing Center.
2015 (English)In: Big Data Benchmarks, Performance Optimization, and Emerging Hardware: 6th Workshop, BPOE 2015, Kohala, HI, USA, August 31 - September 4, 2015. Revised Selected Papers, Springer, 2015, Vol. 9495, 81-92 p.Conference paper, Published paper (Refereed)
Abstract [en]

Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10 % better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x.

Place, publisher, year, edition, pages
Springer, 2015. Vol. 9495, 81-92 p.
Series
Lecture Notes in Computer Science
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-181325DOI: 10.1007/978-3-319-29006-5_7Scopus ID: 2-s2.0-84958073801ISBN: 978-3-319-29005-8 (print)OAI: oai:DiVA.org:kth-181325DiVA: diva2:899225
Conference
6th International Workshop on Bigdata Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with 41st International Conference on Very Large Data Bases (VLDB),Kohala, HI, USA, August 31 - September 4, 2015
Note

QC 20160224

Available from: 2016-02-01 Created: 2016-02-01 Last updated: 2016-04-25Bibliographically approved
In thesis
1. Performance Characterization of In-Memory Data Analytics on a Scale-up Server
Open this publication in new window or tab >>Performance Characterization of In-Memory Data Analytics on a Scale-up Server
2016 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark defines the state of the art in big data analytics platforms for (i) exploiting data-flow and in-memory computing and (ii) for exhibiting superior scale-out performance on the commodity machines, little effort has been devoted at understanding the performance of in-memory data analytics with Spark on modern scale-up servers. This thesis characterizes the performance of in-memory data analytics with Spark on scale-up servers.

Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation. We have also found that workloads are bound by the latency of frequent data accesses to DRAM. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization).

For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 14%. For GC impact, we match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x. and recommend to use multiple small executors that can provide up-to 36% speedup over single large executor.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2016. 111 p.
Series
TRITA-ICT, 2016:07
National Category
Computer Systems
Research subject
Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-185581 (URN)978-91-7595-926-9 (ISBN)
Presentation
2016-05-23, Ka-210, Electrum 229, Kista, Stockholm, 09:15 (English)
Opponent
Supervisors
Note

QC 20160425

Available from: 2016-04-25 Created: 2016-04-22 Last updated: 2017-03-02Bibliographically approved

Open Access in DiVA

fulltext(2219 kB)73 downloads
File information
File name FULLTEXT01.pdfFile size 2219 kBChecksum SHA-512
da01e3316474d87a9f4e3142cfd05c2956047f07810b27c3c03d16a07e9e4122f4bb818c293d9a32c78459a4710da3a76d5723dbd193fff53d01605e950e10cf
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopusWorkshop website

Search in DiVA

By author/editor
Awan, Ahsan JavedBrorsson, MatsVlassov, Vladimir
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 73 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1173 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf