Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Architecture and Computer Communication. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. (Computer Architecture (UART))
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Modern processors apply sophisticated techniques, such as deep cache hierarchies and hardware prefetching, to increase performance. Such complex hardware structures have helped improve performance in general, however, their full potential is not realized as software often utilizes the memory hierarchy inefficiently. Performance can be improved further by ensuring careful interaction between software and hardware. Performance can typically improve by increasing the cache utilization and by conserving the DRAM bandwidth, i.e., retaining more useful data in the caches and lowering data requests to the DRAM. One way to achieve this is to conserve space across the cache hierarchy and increase opportunity for temporal reuse of cached data. Similarly, conserving the DRAM bandwidth is essential for performance in highly utilized multicores, as it can easily become a critical resource. When multiple cores are active and the per-core share of DRAM bandwidth shrinks, its efficient utilization plays an important role in improving the overall performance. Together the cache hierarchy and the DRAM bandwidth play a significant role in defining the overall performance in multicores.

Based on deep insight from memory behavior modeling of software, this thesis explores five software-only methods to analyze and increase performance in multicores. The underlying philosophy that drives these techniques is to increase cache utilization and conserve DRAM bandwidth by 1) focusing on making data prefetching more accurate, and 2) lowering the miss rate in the cache hierarchy either by preserving useful data longer by cache-bypassing the less useful data or via code size compaction using compiler options. First, we show how microarchitecture-independent memory access profiles can be used to analyze the Instruction Cache performance of software. We use this information in a compiler pass to recompile application phases (with large Instruction cache miss rate) for smaller code size in an effort to improve the application Instruction Cache behavior. Second, we demonstrate how a resourceefficient software prefetching method can be combined with hardware prefetching to improve performance in multicores when running software that exhibits irregular memory access patterns. Third, we show that hardware prefetching on high performance commodity multicores is sub-optimal and demonstrate how a resource-efficient software-only prefetching method can perform better in fully utilized multicores. Fourth, we present an adaptive prefetching approach that dynamically combines software and hardware prefetching in a runtime system to improve performance in highly utilized multicores. Finally, in the fifth work we develop a method to predict per-core prefetching configurations that deliver near-optimal overall multicore performance. These software techniques enable us to tap greater performance in multicores (up to 50%), without requiring more processing resources.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2016. , 54 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1335
Keyword [en]
Performance, Optimization, Prefetching, multicore, memory hierarchy
National Category
Computer Science
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:uu:diva-272095ISBN: 978-91-554-9450-6 (print)OAI: oai:DiVA.org:uu-272095DiVA: diva2:893081
Public defence
2016-03-21, ITC/2446, Informationsteknologiskt centrum, Lägerhyddsvägen 2, Uppsala, 13:00 (English)
Opponent
Supervisors
Available from: 2016-02-25 Created: 2016-01-11 Last updated: 2016-04-18Bibliographically approved
List of papers
1. Low Overhead Instruction-Cache Modeling Using Instruction Reuse Profiles
Open this publication in new window or tab >>Low Overhead Instruction-Cache Modeling Using Instruction Reuse Profiles
2012 (English)In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'12), IEEE Computer Society , 2012, 260-269 p.Conference paper, Published paper (Refereed)
Abstract [en]

Performance loss caused by L1 instruction cache misses varies between different architectures and cache sizes. For processors employing power-efficient in-order execution with small caches, performance can be significantly affected by instruction cache misses. The growing use of low-power multi-threaded CPUs (with shared L1 caches) in general purpose computing platforms requires new efficient techniques for analyzing application instruction cache usage. Such insight can be achieved using traditional simulation technologies modeling several cache sizes, but the overhead of simulators may be prohibitive for practical optimization usage. In this paper we present a statistical method to quickly model application instruction cache performance. Most importantly we propose a very low-overhead sampling mechanism to collect runtime data from the application's instruction stream. This data is fed to the statistical model which accurately estimates the instruction cache miss ratio for the sampled execution. Our sampling method is about 10x faster than previously suggested sampling approaches, with average runtime overhead as low as 25% over native execution. The architecturally-independent data collected is used to accurately model miss ratio for several cache sizes simultaneously, with average absolute error of 0.2%. Finally, we show how our tool can be used to identify program phases with large instruction cache footprint. Such phases can then be targeted to optimize for reduced code footprint.

Place, publisher, year, edition, pages
IEEE Computer Society, 2012
Series
Computer Architecture and High Performance Computing, ISSN 1550-6533
National Category
Computer Systems Computer Science Computer Engineering
Research subject
Computer Science; Computer Systems
Identifiers
urn:nbn:se:uu:diva-180148 (URN)10.1109/SBAC-PAD.2012.25 (DOI)978-1-4673-4790-7 (ISBN)
Conference
24 th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), October 2012, New York, NY, USA
Projects
CoDeR-MPUPMARC
Available from: 2012-08-30 Created: 2012-08-30 Last updated: 2016-03-09
2. A case for resource efficient prefetching in multicores
Open this publication in new window or tab >>A case for resource efficient prefetching in multicores
2014 (English)In: Proc. 43rd International Conference on Parallel Processing, IEEE Computer Society, 2014, 101-110 p.Conference paper, Published paper (Refereed)
Abstract [en]

Modern processors typically employ sophisticated prefetching techniques for hiding memory latency. Hardware prefetching has proven very effective and can speed up some SPEC CPU 2006 benchmarks by more than 40% when running in isolation. However, this speedup often comes at the cost of prefetching a significant volume of useless data (sometimes more than twice the data required) which wastes shared last level cache space and off-chip bandwidth. This paper explores how an accurate resource-efficient prefetching scheme can benefit performance by conserving shared resources in multicores. We present a framework that uses low-overhead runtime sampling and fast cache modeling to accurately identify memory instructions that frequently miss in the cache. We then use this information to automatically insert software prefetches in the application. Our prefetching scheme has good accuracy and employs cache bypassing whenever possible. These properties help reduce off-chip bandwidth consumption and last-level cache pollution. While single-thread performance remains comparable to hardware prefetching, the full advantage of the scheme is realized when several cores are used and demand for shared resources grows. We evaluate our method on two modern commodity multicores. Across 180 mixed workloads that fully utilize a multicore, the proposed software prefetching mechanism achieves up to 24% better throughput than hardware prefetching, and performs 10% better on average.

Place, publisher, year, edition, pages
IEEE Computer Society, 2014
National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-234547 (URN)10.1109/ICPP.2014.19 (DOI)978-1-4799-5618-0 (ISBN)
Conference
2014 43nd International Conference on Parallel Processing (ICPP), September 9-12, Minneapolis, MN
Available from: 2014-11-25 Created: 2014-10-20 Last updated: 2016-03-09Bibliographically approved
3. AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance
Open this publication in new window or tab >>AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance
Show others...
2015 (English)In: Proc. 24th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, 2015, 367-378 p.Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
IEEE Computer Society, 2015
National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-265614 (URN)10.1109/PACT.2015.35 (DOI)000378942700031 ()978-1-4673-9524-3 (ISBN)
Conference
PACT 2015, October 18–21, San Francisco, CA
Available from: 2015-11-02 Created: 2015-11-02 Last updated: 2016-08-10Bibliographically approved
4. Resource conscious prefetching for irregular applications in multicores
Open this publication in new window or tab >>Resource conscious prefetching for irregular applications in multicores
2014 (English)In: Proc. International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), Piscataway, NJ: IEEE , 2014, 34-43 p.Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Piscataway, NJ: IEEE, 2014
National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-234545 (URN)10.1109/SAMOS.2014.6893192 (DOI)000361019300005 ()978-1-4799-3770-7 (ISBN)
Conference
SAMOS XIV, July 14–17, 2014
Projects
UPMARC
Available from: 2014-07-17 Created: 2014-10-20 Last updated: 2016-03-09Bibliographically approved
5. Perf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores
Open this publication in new window or tab >>Perf-Insight: A Simple, Scalable Approach to Optimal Data Prefetching in Multicores
2015 (English)Report (Other academic)
Series
Technical report / Department of Information Technology, Uppsala University, ISSN 1404-3203 ; 2015-037
National Category
Computer Science
Identifiers
urn:nbn:se:uu:diva-271148 (URN)
Available from: 2015-12-28 Created: 2016-01-06 Last updated: 2016-03-09Bibliographically approved

Open Access in DiVA

fulltext(811 kB)204 downloads
File information
File name FULLTEXT01.pdfFile size 811 kBChecksum SHA-512
e7100217b8d9764c745fc8e8beab911f991e416f1df8e0b566ed0abcd7440fbe211bf338fad52dc41ef4fa2160d99d94a2bf4bdcc6d03df874ec9d91456fd880
Type fulltextMimetype application/pdf