Change search
ReferencesLink to record
Permanent link

Direct link
Efficient Execution Paradigms for Parallel Heterogeneous Architectures
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. (UART - Uppsala Architecture Research Team)ORCID iD: 0000-0002-9460-1290
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis proposes novel, efficient execution-paradigms for parallel heterogeneous architectures. The end of Dennard scaling is threatening the effectiveness of DVFS in future nodes; therefore, new execution paradigms are required to exploit the non-linear relationship between performance and energy efficiency of memory-bound application-regions. To attack this problem, we propose the decoupled access-execute (DAE) paradigm. DAE transforms regions of interest (at program-level) in two coarse-grain phases: the access-phase and the execute-phase, which we can independently DVFS. The access-phase is intended to prefetch the data in the cache, and is therefore expected to be predominantly memory-bound, while the execute-phase runs immediately after the access-phase (that has warmed-up the cache) and is therefore expected to be compute-bound.

DAE, achieves good energy savings (on average 25% lower EDP) without performance degradation, as opposed to other DVFS techniques. Furthermore, DAE increases the memory level parallelism (MLP) of memory-bound regions, which results in performance improvements of memory-bound applications. To automatically transform application-regions to DAE, we propose compiler techniques to automatically generate and incorporate the access-phase(s) in the application. Our work targets affine, non-affine, and even complex, general-purpose codes. Furthermore, we explore the benefits of software multi-versioning to optimize DAE in dynamic environments, and handle codes with statically unknown access-phase overheads. In general, applications automatically-transformed to DAE by our compiler, maintain (or even exceed in some cases) the good performance and energy efficiency of manually-optimized DAE codes.

Finally, to ease the programming environment of heterogeneous systems (with integrated GPUs), we propose a novel system-architecture that provides unified virtual memory with low overhead. The underlying insight behind our work is that existing data-parallel programming models are a good fit for relaxed memory consistency models (e.g., the heterogeneous race-free model). This allows us to simplify the coherency protocol between the CPU – GPU, as well as the GPU memory management unit. On average, we achieve 45% speedup and 45% lower EDP over the corresponding SC implementation.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2016. , 54 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1405
Keyword [en]
Decoupled Execution, Performance, Energy, DVFS, Compiler Optimizations, Heterogeneous Coherence
National Category
Computer Systems
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:uu:diva-300831ISBN: 978-91-554-9654-8OAI: oai:DiVA.org:uu-300831DiVA: diva2:952645
Public defence
2016-09-30, ITC/1111, Lägerhyddsvägen 2, Uppsala, 13:00 (English)
Opponent
Supervisors
Funder
EU, FP7, Seventh Framework Programme, FP7-ICT-288653Swedish Research Council
Available from: 2016-09-07 Created: 2016-08-15 Last updated: 2016-09-13
List of papers
1. Towards more efficient execution: a decoupled access-execute approach
Open this publication in new window or tab >>Towards more efficient execution: a decoupled access-execute approach
2013 (English)In: Proc. 27th ACM International Conference on Supercomputing, New York: ACM Press, 2013, 253-262 p.Conference paper (Refereed)
Abstract [en]

The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the energy savings of this technique. This paper evaluates how much we can increase the effectiveness of DVFS by using a software decoupled access-execute approach. Decoupling the data access from execution allows us to apply optimal voltage-frequency selection for each phase and therefore improve energy efficiency over standard coupled execution.

The underlying insight of our work is that by decoupling access and execute we can take advantage of the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency, while maintaining good performance. To demonstrate this we built a task based parallel execution infrastructure consisting of: (1) a runtime system to orchestrate the execution, (2) power models to predict optimal voltage-frequency selection at runtime, (3) a modeling infrastructure based on hardware measurements to simulate zero-latency, per-core DVFS, and (4) a hardware measurement infrastructure to verify our model's accuracy.

Based on real hardware measurements we project that the combination of decoupled access-execute and DVFS has the potential to improve EDP by 25% without hurting performance. On memory-bound applications we significantly improve performance due to increased MLP in the access phase and ILP in the execute phase. Furthermore we demonstrate that our method can achieve high performance both in presence or absence of a hardware prefetcher.

Place, publisher, year, edition, pages
New York: ACM Press, 2013
Keyword
Task-Based Execution, Decoupled Execution, Performance, Energy, DVFS
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:uu:diva-203239 (URN)10.1145/2464996.2465012 (DOI)978-1-4503-2130-3 (ISBN)
Conference
ICS 2013, June 10-14, Eugene, OR
Projects
LPGPU FP7-ICT-288653UPMARC
Funder
EU, FP7, Seventh Framework Programme, ICT-288653Swedish Research Council
Available from: 2013-07-06 Created: 2013-07-05 Last updated: 2016-09-02Bibliographically approved
2. Fix the code. Don't tweak the hardware: A new compiler approach to Voltage–Frequency scaling
Open this publication in new window or tab >>Fix the code. Don't tweak the hardware: A new compiler approach to Voltage–Frequency scaling
Show others...
2014 (English)In: Proc. 12th International Symposium on Code Generation and Optimization, New York: ACM Press, 2014, 262-272 p.Conference paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2014
National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-212778 (URN)978-1-4503-2670-4 (ISBN)
Conference
CGO 2014, February 15-19, Orlando, FL
Projects
UPMARC
Available from: 2014-02-19 Created: 2013-12-13 Last updated: 2016-09-02Bibliographically approved
3. Multiversioned decoupled access-execute: The key to energy-efficient compilation of general-purpose programs
Open this publication in new window or tab >>Multiversioned decoupled access-execute: The key to energy-efficient compilation of general-purpose programs
Show others...
2016 (English)In: Proc. 25th International Conference on Compiler Construction, New York: ACM Press, 2016, 121-131 p.Conference paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2016
National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-283200 (URN)10.1145/2892208.2892209 (DOI)9781450342414 (ISBN)
Conference
CC 2016, March 17–18, Barcelona, Spain
Projects
UPMARC
Available from: 2016-03-17 Created: 2016-04-11 Last updated: 2016-09-02Bibliographically approved
4. Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead
Open this publication in new window or tab >>Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead
2016 (English)In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, E-ISSN 1544-3973, Vol. 13, no 1, 1Article in journal (Refereed) Published
Abstract [en]

This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory. Research proposals implement coherence protocols for sequential consistency (SC) between central processing unit (CPU) cores and between devices. Such mechanisms introduce severe bottlenecks in the system; therefore, we adopt the heterogeneous-race-free (HRF) memory model. The use of HRF simplifies the coherency protocol and the graphics processing unit (GPU) memory management unit (MMU). Our protocol optimizes CPU and GPU demands separately, with the GPU part being simpler while the CPU is more elaborate and latency aware. We achieve an average 45% speedup and 45% energy-delay product reduction (20% energy) over the corresponding SC implementation.

Keyword
Multicore; heterogeneous coherence; GPU MMU design; virtual coherence protocol; directory-less protocol
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-295765 (URN)10.1145/2889488 (DOI)000373904600001 ()
Projects
UPMARC
Funder
EU, FP7, Seventh Framework Programme, FP7-ICT-288653EU, European Research Council, TIN2012-38341-C04-03
Available from: 2016-04-05 Created: 2016-06-09 Last updated: 2016-09-02Bibliographically approved

Open Access in DiVA

fulltext(1507 kB)27 downloads
File information
File name FULLTEXT01.pdfFile size 1507 kBChecksum SHA-512
4c3b7b8376bd30603665d74abc5a76fbc1f8103d5bee00dd6ec8ad6f9e774db2efb38091341e46ff223ce329d536a30a7fef26f887ff61f717f4836efaedd465
Type fulltextMimetype application/pdf
Buy this publication >>

Search in DiVA

By author/editor
Koukos, Konstantinos
By organisation
Division of Computer SystemsComputer Architecture and Computer Communication
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 27 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 380 hits
ReferencesLink to record
Permanent link

Direct link