Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Improving Performance and Quality-of-Service through the Task-Parallel Model​: Optimizations and Future Directions for OpenMP
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

With the failure of Dennard's scaling, which stated that shrinking transistors will be more power-efficient, computer hardware has today become very divergent. Initially the change only concerned the number of processor on a chip (multicores), but has today further escalated into complex heterogeneous system with non-intuitive properties -- properties that can improve performance and power consumption but also strain the programmer expected to develop on them.

Answering these challenges is the OpenMP task-parallel model -- a programming model that simplifies writing parallel software. Our focus in the thesis has been to explore performance and quality-of-service directions of the OpenMP task-parallel model, particularly by taking architectural features into account.

The first question tackled is: what capabilities does existing state of the art runtime-systems have and how do they perform? We empirically evaluated the performance of several modern task-parallel runtime-systems. Performance and power-consumption was measured through the use of benchmarks and we show that the two primary causes for bottlenecks in modern runtime-systems lies in either the task management overheads or how tasks are being distributed across processors.

Next, we consider quality-of-service improvements in task-parallel runtime-systems. Striving to improve execution performance, current state of the art runtime-systems seldom take dynamic architectural features such as temperature into account when deciding how work should be distributed across the processors, which can lead to overheating. We developed and evaluated two strategies for thermal-awareness in task-parallel runtime-systems. The first improves performance when the computer system is constrained by temperature while the second strategy strives to reduce temperature while meeting soft real-time objectives.

We end the thesis by focusing on performance. Here we introduce our original contribution called BLYSK -- a prototype OpenMP framework created exclusively for performance research.

We found that overheads in current runtime-systems can be expensive, which often lead to performance degradation. We introduce a novel way of preserving task-graphs throughout application runs: task-graphs are recorded, identified and optimized the first time an OpenMP application is executed and are later re-used in following executions, removing unnecessary overheads. Our proposed solution can nearly double the performance compared with other state of the art runtime-systems.

Performance can also be improved through heterogeneity. Today, manufacturers are placing processors with different capabilities on the same chip. Because they are different, their power-consuming characteristics and performance differ. Heterogeneity adds another dimension to the multiprocessing problem: how should work be distributed across the heterogeneous processors?We evaluated the performance of existing, homogeneous scheduling algorithms and found them to be an ill-match for heterogeneous systems. We proposed a novel scheduling algorithm that dynamically adjusts itself to the heterogeneous system in order to improve performance.

The thesis ends with a high-level synthesis approach to improve performance in task-parallel applications. Rather than limiting ourselves to off-the-shelf processors -- which often contains a large amount of unused logic -- our approach is to automatically generate the processors ourselves. Our method allows us to generate application-specific hardware from the OpenMP task-parallel source code. Evaluated using FPGAs, the performance of our System-on-Chips outperformed other soft-cores such as the NiosII processor and were also comparable in performance with modern state of the art processors such as the Xeon PHI and the AMD Opteron.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2015. , 64 p.
Series
TRITA-ICT, 2015:13
Keyword [en]
Task Parallel, OpenMP, Scheduling, OmpSs, multicore, manycore
National Category
Communication Systems
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-175539ISBN: 978-91-7595-711-1 (print)OAI: oai:DiVA.org:kth-175539DiVA: diva2:861299
Public defence
2015-11-10, Sal A, KTH Kista, Electrum Kistagången 16, Kista, 10:00 (English)
Opponent
Supervisors
Note

QC 20151016

Available from: 2015-10-16 Created: 2015-10-16 Last updated: 2015-10-16Bibliographically approved
List of papers
1. A comparative performance study of common and popular task-centric programming frameworks
Open this publication in new window or tab >>A comparative performance study of common and popular task-centric programming frameworks
2013 (English)In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634Article in journal, Editorial material (Refereed) Published
Abstract [en]

SUMMARY: Programmers today face a bewildering array of parallel programming models and tools, making it difficult to choose an appropriate one for each application. An increasingly popular programming model supporting structured parallel programming patterns in a portable and composable manner is the task-centric programming model. In this study, we compare several popular task-centric programming frameworks, including Cilk Plus, Threading Building Blocks, and various implementations of OpenMP 3.0. We have analyzed their performance on the Barcelona OpenMP Tasking Suite benchmark suite both on a 48-core AMD Opteron 6172 server and a 64-core TILEPro64 embedded many-core processor. Our results show that the OpenMP offers the highest flexibility for programmers, and this flexibility comes to a cost. Frameworks supporting only a specific and more restrictive model, such as Cilk Plus and Threading Building Blocks, are generally more efficient both in terms of performance and energy consumption. However, Intel's implementation of OpenMP tasks performs the best and closest to the specialized run-time systems.

Keyword
OpenMP, cilk, TBB, wool, scheduling task performance
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-143625 (URN)10.1002/cpe.3186 (DOI)000349085100001 ()2-s2.0-84919463068 (Scopus ID)
Note

QC 20140626

Available from: 2014-03-26 Created: 2014-03-26 Last updated: 2017-12-05Bibliographically approved
2. Architecture-aware Task-scheduling: A thermal approach
Open this publication in new window or tab >>Architecture-aware Task-scheduling: A thermal approach
2011 (English)In: http://faspp.ac.upc.edu/faspp11/, 2011Conference paper, Published paper (Refereed)
Abstract [en]

Current task-centric many-core schedulers share a “naive” view of processor architecture; a view that does not care about its thermal, architectural or power consuming properties. Future processor will be more heterogeneous than what we see today, and following Moore’s law of transistor doubling, we foresee an increase in power consumption and thus temperature.

Thermal stress can induce errors in processors, and so a common way to counter this is by slowing the processor down; something task-centric schedulers should strive to avoid. The Thermal-Task-Interleaving scheduling algorithm proposed in this paper takes both the application temperature behavior and architecture into account when making decisions. We show that for a mixed workload, our scheduler outperforms some of the standard, architecture-unaware scheduling solutions existing today.

Keyword
OpenMP, Tasks, Power, Thermal, Temperature, Scheduling, Many-core, Tilera
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-89634 (URN)
Conference
FASPP'11
Note
QC 20120215Available from: 2012-02-15 Created: 2012-02-15 Last updated: 2015-10-16Bibliographically approved
3. Considering Quality-of-Service for Resource Reduction using OpenMP
Open this publication in new window or tab >>Considering Quality-of-Service for Resource Reduction using OpenMP
Show others...
2014 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Not caring about resources means wasting them. Current task-based parallel models such as Cilk or OpenMP care only about execution performance regardless of the actual application resource needs; this can lead to over-consumption resulting in resource waste.We present a technique to overcome the resource un-awareness by extending the programming model and run-time system to dynamically adapt the allocated resources to reect the expected Quality-of-Service of the application.

We show that by considering tasks' timing constraints and the expected quality-of-service in terms of real-time behavior, one can reduce the number of resources and temperature compared to a greedy work-stealing scheduler. Our implementation uses a feedback controller that continuously samples the application-experienced service and dynamically adjusts the number of resources to match the quality required by the application.

National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-143626 (URN)
Conference
MULTIPROG 2014 : Programmability Issues for Heterogeneous Multicores,Jan 22, 2014,Viena, Austria
Note

QC 20140616

Available from: 2014-03-26 Created: 2014-03-26 Last updated: 2015-10-16Bibliographically approved
4. Exploring heterogeneous scheduling using the task-centric programming model
Open this publication in new window or tab >>Exploring heterogeneous scheduling using the task-centric programming model
2013 (English)In: Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349, Vol. 7640Article in journal (Refereed) Published
Abstract [en]

Computer architecture technology is moving towards more heteroge-neous solutions, which will contain a number of processing units with different capabilities that may increase the performance of the system as a whole. How-ever, with increased performance comes increased complexity; complexity that is now barely handled in homogeneous multiprocessing systems. The present study tries to solve a small piece of the heterogeneous puzzle; how can we exploit all system resources in a performance-effective and user-friendly way? Our proposed solution includes a run-time system capable of using a variety of different heterogeneous components while providing the user with the already familiar task-centric programming model interface. Furthermore, when dealing with non-uniform workloads, we show that traditional approaches based on centralized or work-stealing queue algorithms do not work well and propose a scheduling algorithm based on trend analysis to distribute work in a performance-effective way across resources.

Keyword
Task Scheduling, OpenMP, GPU, Tilera, Work-Stealing, Performance
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-120436 (URN)10.1007/978-3-642-36949-0_16 (DOI)000341240400016 ()2-s2.0-84874433328 (Scopus ID)
Conference
HeteroPAR'2012: Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms, August 27, 2012, Rhodes Island, Greece
Projects
ENCORE
Funder
Swedish e‐Science Research Center
Note

QC 20130429

Available from: 2013-04-05 Created: 2013-04-05 Last updated: 2017-12-06Bibliographically approved
5. TurboBŁYSK: Scheduling for improved data-driven task performance with fast dependency resolution
Open this publication in new window or tab >>TurboBŁYSK: Scheduling for improved data-driven task performance with fast dependency resolution
2014 (English)In: Using and Improving OpenMP for Devices, Tasks, and More: 10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014. Proceedings, Springer, 2014, 45-57 p.Conference paper, Published paper (Refereed)
Abstract [en]

Data-driven task-parallelism is attracting growing interest and has now been added to OpenMP (4.0). This paradigm simplifies the writing of parallel applications, extracting parallelism, and facilitates the use of distributed memory architectures. While the programming model itself is becoming mature, a problem with current run-time scheduler implementations is that they require a very large task granularity in order to scale. This limitation goes at odds with the idea of task-parallel programing where programmers should be able to concentrate on exposing parallelism with little regard to the task granularity. To mitigate this limitation, we have designed and implemented TurboBŁYSK, a highly efficient run-time scheduler of tasks with explicit data-dependence annotations. We propose a novel mechanism based on pattern-saving that allows the scheduler to re-use previously resolved dependency patterns, based on programmer annotations, enabling programs to use even the smallest of tasks and scale well. We experimentally show that our techniques in TurboBŁYSK enable achieving nearly twice the peak performance compared with other run-time schedulers. Our techniques are not OpenMP specific and can be implemented in other task-parallel frameworks.

Place, publisher, year, edition, pages
Springer, 2014
Series
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN 0302-9743 ; 8766
Keyword
Application programming interfaces (API), Computer software reusability, Distributed memory architecture, Parallel application, Peak performance, Programming models, Scheduler implementation, Task granularity, Task parallelism, Task performance
National Category
Computer Science
Identifiers
urn:nbn:se:kth:diva-161691 (URN)10.1007/978-3-319-11454-5_4 (DOI)000360155400004 ()2-s2.0-84921513746 (Scopus ID)978-3-319-11453-8 (ISBN)978-3-319-11454-5 (ISBN)
Conference
10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014
Note

QC 20150326

Available from: 2015-03-26 Created: 2015-03-13 Last updated: 2016-01-27Bibliographically approved
6. Accelerating Parallel Computations with OpenMP-Driven System-on-Chip Generation for FPGAs
Open this publication in new window or tab >>Accelerating Parallel Computations with OpenMP-Driven System-on-Chip Generation for FPGAs
2014 (English)In: Embedded Multicore/Manycore SoCs (MCSoc), 2014 IEEE 8th International Symposium on, IEEE conference proceedings, 2014, 149-156 p.Conference paper, Published paper (Refereed)
Abstract [en]

The task-based programming paradigm offers a portable way of writing parallel applications. However, it requires tedious tuning of the application for performance. We present a novel design flow where programmers can use application knowledge to easily generate a System-on-Chip (SoC) specialized in executing the application. Our design flow uses a compiler that automatically generates task-specific cores and packs them into a custom SoC. A SoC-specific runtime systems schedules tasks on cores to accelerate application execution. The generated SoC shows up to 6000 times performance improvement in comparison to the Altera NiosII/s processor and up to 7 times compared to an AMD Opteron 6172 core. Our design flow helps programmers generate high-performance systems without requiring tuning and prior hardware design knowledge.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2014
Keyword
OpenMP HLS, Task-based, Hardware Generation
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-161262 (URN)10.1109/MCSoC.2014.30 (DOI)000357812800021 ()2-s2.0-84917706752 (Scopus ID)
Conference
IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs (MCSoc), 23-25 Sept. 2014, Aizu-Wakamatsu
Note

QC 20150319

Available from: 2015-03-11 Created: 2015-03-11 Last updated: 2015-10-16Bibliographically approved
7. From software to parallel hardware through the OpenMP programming model
Open this publication in new window or tab >>From software to parallel hardware through the OpenMP programming model
(English)Manuscript (preprint) (Other academic)
Identifiers
urn:nbn:se:kth:diva-175540 (URN)
Note

QS 2015

Available from: 2015-10-16 Created: 2015-10-16 Last updated: 2015-10-16Bibliographically approved

Open Access in DiVA

Thesis(587 kB)354 downloads
File information
File name FULLTEXT02.pdfFile size 587 kBChecksum SHA-512
42bd510ae76688f305d4967f267bb641d3e87160f93b91b9ceadc08324a8d58454cdbc5227f70674f79a83e4945a469e4e1c33f0f0e9cdebe2de025932109423
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Podobas, Artur
By organisation
Software and Computer systems, SCS
Communication Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 354 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1661 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf