Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)ORCID iD: 0000-0001-9349-5791
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
2010 (English)In: Proc. International Conference for High Performance Computing, Networking, Storage and Analysis: SC 2010, Piscataway, NJ: IEEE , 2010, p. 11-Conference paper, Published paper (Refereed)
Abstract [en]

Contention for shared cache resources has been recognized as a major bottleneck for multicores—especially for mixed workloads of independent applications. While most modern processors implement instructions to manage caches, these instructions are largely unused due to a lack of understanding of how to best leverage them. This paper introduces a classification of applications into four cache usage categories. We discuss how applications from different categories affect each other's performance indirectly through cache sharing and devise a scheme to optimize such sharing. We also propose a low-overhead method to automatically find the best per-instruction cache management policy. We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. Practical experiments demonstrate that our software-only method can improve application performance up to 35% on x86 multicore hardware.

Place, publisher, year, edition, pages
Piscataway, NJ: IEEE , 2010. p. 11-
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:uu:diva-134386DOI: 10.1109/SC.2010.44ISBN: 978-1-4244-7557-5 (print)OAI: oai:DiVA.org:uu-134386DiVA, id: diva2:372276
Conference
2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, November 13-19 2010
Projects
Coder-mpUPMARC
Available from: 2010-11-25 Created: 2010-11-24 Last updated: 2018-01-12Bibliographically approved
In thesis
1. Understanding Multicore Performance: Efficient Memory System Modeling and Simulation
Open this publication in new window or tab >>Understanding Multicore Performance: Efficient Memory System Modeling and Simulation
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results.

In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2014. p. 54
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1136
Keyword
Computer Architecture, Simulation, Modeling, Sampling, Caches, Memory Systems, gem5, Parallel Simulation, Virtualization, Sampling, Multicore
National Category
Computer Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-220652 (URN)978-91-554-8922-9 (ISBN)
Public defence
2014-05-22, Room 2446, Polacksbacken, Lägerhyddsvägen 2, Uppsala, 09:30 (English)
Opponent
Supervisors
Projects
CoDeR-MPUPMARC
Available from: 2014-04-28 Created: 2014-03-18 Last updated: 2018-01-11Bibliographically approved

Open Access in DiVA

fulltext(290 kB)1239 downloads
File information
File name FULLTEXT01.pdfFile size 290 kBChecksum SHA-512
f56693a0da48a297a696fb920da34d75e5565054fb0fa62e0475340e1a6df23b73817724f982525f5fe1668621879c7b5eab5404a7d04f301050e438565ccf61
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Search in DiVA

By author/editor
Sandberg, AndreasEklöv, DavidHagersten, Erik
By organisation
Computer Systems
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 1239 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 749 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
v. 2.34-SNAPSHOT
|