Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Improving OpenMP Productivity with Data Locality Optimizations and High-resolution Performance Analysis
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.ORCID iD: 0000-0003-3958-4659
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The combination of high-performance parallel programming and multi-core processors is the dominant approach to meet the ever increasing demand for computing performance today. The thesis is centered around OpenMP, a popular parallel programming API standard that enables programmers to quickly get started with writing parallel programs. However, in contrast to the quickness of getting started, writing high-performance OpenMP programs requires high effort and saps productivity.

Part of the reason for impeded productivity is OpenMP’s lack of abstractions and guidance to exploit the strong architectural locality exhibited in NUMA systems and manycore processors. The thesis contributes with data distribution abstractions that enable programmers to distribute data portably in NUMA systems and manycore processors without being aware of low-level system topology details. Data distribution abstractions are supported by the runtime system and leveraged by the second contribution of the thesis – an architecture-specific locality-aware scheduling policy that reduces data access latencies incurred by tasks, allowing programmers to obtain with minimal effort upto 69% improved performance for scientific programs compared to state-of-the-art work-stealing scheduling.

Another reason for reduced programmer productivity is the poor support extended by OpenMP performance analysis tools to visualize, understand, and resolve problems at the level of grains– task and parallel for-loop chunk instances. The thesis contributes with a cost-effective and automatic method to extensively profile and visualize grains. Grain properties and hardware performance are profiled at event notifications from the runtime system with less than 2.5% overheads and visualized using a new method called theGrain Graph. The grain graph shows the program structure that unfolded during execution and highlights problems such as low parallelism, work inflation, and poor parallelization benefit directly at the grain level with precise links to problem areas in source code. The thesis demonstrates that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing tools in standard programs from SPEC OMP 2012, Parsec 3.0 and Barcelona OpenMP Tasks Suite (BOTS). Grain profiles are also applied to study the input sensitivity and similarity of BOTS programs.

All thesis contributions are assembled together to create an iterative performance analysis and optimization work-flow that enables programmers to achieve desired performance systematically and more quickly than what is possible using existing tools. This reduces pressure on experts and removes the need for tedious trial-and-error tuning, simplifying OpenMP performance analysis.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. , 208 p.
Series
TRITA-ICT, 2016:1
Keyword [en]
OpenMP, Performance Analysis, Scheduling, Locality Optimizations
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-179670ISBN: 978-91-7595-818-7 (print)OAI: oai:DiVA.org:kth-179670DiVA: diva2:885520
Public defence
2016-01-29, Sal C, Sal C, Electrum, Isafjordsgatan 26, Kista, 09:00 (English)
Opponent
Supervisors
Note

QC 20151221

Available from: 2015-12-21 Created: 2015-12-18 Last updated: 2016-01-15Bibliographically approved
List of papers
1. Locality-aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors
Open this publication in new window or tab >>Locality-aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors
2015 (English)In: Scientific Programming, ISSN 1058-9244, E-ISSN 1875-919X, 981759Article in journal (Refereed) Published
Abstract [en]

Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and on manycore processors is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor, and we identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers.

Place, publisher, year, edition, pages
Hindawi Publishing Corporation, 2015
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-166580 (URN)10.1155/2015/981759 (DOI)000364899300001 ()2-s2.0-84947272497 (Scopus ID)
Note

QC 20150615

Available from: 2015-05-11 Created: 2015-05-11 Last updated: 2018-01-11Bibliographically approved
2. Characterizing task-based OpenMP programs
Open this publication in new window or tab >>Characterizing task-based OpenMP programs
2015 (English)In: PLoS ONE, ISSN 1932-6203, E-ISSN 1932-6203, Vol. 10, no 4, e0123545- p.Article in journal (Refereed) Published
Abstract [en]

Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance.

Keyword
Scheduling Strategies, Performance Analysis, Benchmark
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-141201 (URN)10.1371/journal.pone.0123545 (DOI)000352590300104 ()25860023 (PubMedID)2-s2.0-84929498034 (Scopus ID)
Note

QC 20150623. Updated from "Manuscript" to "Article".

Available from: 2014-02-12 Created: 2014-02-12 Last updated: 2017-12-06Bibliographically approved
3. Grain Graphs: OpenMP Performance Analysis Made Easy
Open this publication in new window or tab >>Grain Graphs: OpenMP Performance Analysis Made Easy
2016 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is interleaved with other tasks in an unpredictable order. Problems with OpenMP parallel for-loops are similarly difficult to resolve since tools only visualize aggregate thread-level statistics such as load imbalance without zooming into a per-chunk granularity. The runtime system/threads oriented visualization provides poor support for understanding problems with task and chunk execution time, parallelism, and memory hierarchy utilization, forcing average programmers to rely on experts or use tedious trial-and-error tuning methods for performance. We present grain graphs, a new OpenMP performance analysis method that visualizes grains - computation performed by a task or a parallel for-loop chunk instance - and highlights problems such as low parallelism, work inflation and poor parallelization benefit at the grain level. We demonstrate that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing visualizations in standard OpenMP programs, simplifying OpenMP performance analysis. This enables average programmers to make portable optimizations for poor performing OpenMP programs, reducing pressure on experts and removing the need for tedious trial-and-error tuning.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2016
Keyword
OpenMP, Performance Analysis, Parallel Programming
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-179668 (URN)10.1145/2851141.2851156 (DOI)000393580200029 ()2-s2.0-84963732767 (Scopus ID)
Conference
21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'16)
Note

QC 20170313

Available from: 2015-12-18 Created: 2015-12-18 Last updated: 2017-03-13Bibliographically approved

Open Access in DiVA

phd-thesis-AM(26752 kB)405 downloads
File information
File name FULLTEXT01.pdfFile size 26752 kBChecksum SHA-512
7f3f2a86f0a1d83fad64ff657d3038c77b0ab0581bd69af6ae575a4cbfdbb83b56637863f5b6e92ff9e519ffe31902f0533aae0a05865b21a528ba0c8d5ae889
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Muddukrishna, Ananya
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 405 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1611 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf