Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
2017 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

The sheer increase in the volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark defines the state of the art in big data analytics platforms for (i) exploiting data-flow and in-memory computing and (ii) for exhibiting superior scale-out performance on the commodity machines, little effort has been devoted to understanding the performance of in-memory data analytics with Spark on modern scale-up servers. This thesis characterizes the performance of in-memory data analytics with Spark on scale-up servers.Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation (the additional CPU time spent by threads in a multi-threaded computation beyond the CPU time required to perform the same work in a sequential computation). We have also found that workloads are bound by the latency of frequent data accesses to the memory. By enlarging input data size, application performance degrades significantly due to the substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1cache misses and higher core utilization).For data accesses, we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average,(ii) disabling next-line L1-D prefetchers can reduce the execution time by upto14%. For garbage collection impact, we match memory behavior with the garbage collector to improve the performance of applications between 1.6xto 3x and recommend using multiple small Spark executors that can provide up to 36% reduction in execution time over single large executor. Based on the characteristics of workloads, the thesis envisions near-memory and near storage hardware acceleration to improve the single-node performance of scale-out frameworks like Apache Spark. Using modeling techniques, it estimates the speed-up of 4x for Apache Spark on scale-up servers augmented with near-data accelerators.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2017. , p. 176
Series
TRITA-ICT ; 2017:23
Keyword [en]
Workload Characterization, Big Data Analytics, Multicore Performance, Apache Spark, Near Data Processing, NUMA, Hyperthreading, Prefetchers, Coherently attached accelerators
National Category
Computer Systems
Research subject
Information and Communication Technology
Identifiers
URN: urn:nbn:se:kth:diva-217910ISBN: 978-91-7729-584-6 (print)OAI: oai:DiVA.org:kth-217910DiVA, id: diva2:1158303
Public defence
2017-12-15, Sal-C, Electrum, Kungl Tekniska högskolan, Kistagången 16,, Kista., 09:00 (English)
Opponent
Supervisors
Note

QC 20171121

Available from: 2017-11-21 Created: 2017-11-20 Last updated: 2017-11-21Bibliographically approved

Open Access in DiVA

PhD_thesis_AJA(4865 kB)228 downloads
File information
File name FULLTEXT01.pdfFile size 4865 kBChecksum SHA-512
094c113689f5eb2d2b34cdfb3b544b775da82d9ad760f9519f6e6a784bed4b8af15c1f2e363a4b380132c519b42012ee4ef7ae13f75ce307af47a18e47018a31
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Awan, Ahsan Javed
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 228 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 816 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
v. 2.34-SNAPSHOT
|