Change search
ReferencesLink to record
Permanent link

Direct link
Performance Modeling of Heterogeneous Systems
Norwegian University of Science and Technology, Faculty of Information Technology, Mathematics and Electrical Engineering, Department of Computer and Information Science.
2012 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

As the complexity of parallel computers grows, constraints posed by the construction of larger systems require both greater, and increasingly non-linear, parameter sets to model their behavior realistically. These heterogeneous characteristics create a trade-off between the complexity and accuracy of performance models, creating challenges in utilizing them for design decisions.

In this thesis, we take a bottom-up approach to realistically model software and hardware interactions, by composing system models from simpler, linear models, which allow parts of the analysis to be automated. We associate empirically benchmarked platform performance metrics with the core elements in a variant of bulk-synchronous execution, aiming to quantify application performance, and associated potential for computation and communication overlap on SMP clusters.

The original bulk-synchronous performance model is introduced, and we identify areas of computation and communication where its abstractions impede realistic models of contemporary hardware. These are addressed independently, using experimental evidence to develop a representation collecting computation kernel characteristics and pairwise communications in matrices, to combine into a system model. As bulk-synchronous execution strongly depends on periodic, global synchronization, we develop a cost model for it by combining latency measurements with a parametric representation of signalling patterns, and experimentally verify the resulting predictions for three common algorithms.

We describe a design to implement the BSPLib programming interface, combining threads and message-passing parallelism to achieve overlap on commodity cluster platforms, implementing its one-sided communication primitives using out-of-band control messages. We augment and validate the cost model of one adapted synchronization algorithm with the corresponding bandwidth requirement, completing a framework for modeling BSPLib program performance.

Finally, we test the utility of this framework as a proof-of-concept for guiding software performance adaptations, using two cases. First, we use the latency terms to automatically generate synchronization operations, using model predictions to generate customized patterns with respect to platform topology, showing that the resulting algorithms equal or outperform the system defaults. Second, the strong scaling characteristics of a 5-point stencil code is compared for three implementations. Experiments show the performance overhead of our implementation, but also its capability for predicting program cost, including parameter values to optimize for balanced overlapping of computation and communication.

Place, publisher, year, edition, pages
NTNU, 2012.
Doctoral theses at NTNU, ISSN 1503-8181 ; 2012:344
National Category
Information and communication science
URN: urn:nbn:no:ntnu:diva-19705ISBN: 978-82-471-4015-4 (printed ver.)ISBN: 978-82-471-4016-1 (electronic ver.)OAI: diva2:578154
Public defence
2012-12-10, 00:00
Available from: 2012-12-17 Created: 2012-12-17 Last updated: 2012-12-17Bibliographically approved

Open Access in DiVA

fulltekst(1057 kB)214 downloads
File information
File name FULLTEXT01.pdfFile size 1057 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
Department of Computer and Information Science
Information and communication science

Search outside of DiVA

GoogleGoogle Scholar
Total: 214 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 61 hits
ReferencesLink to record
Permanent link

Direct link