Change search
ReferencesLink to record
Permanent link

Direct link
Hiding and Reducing Memory Latency: Energy-Efficient Pipeline and Memory System Techniques
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Computer Systems. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computer Systems. (UART)
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Memory accesses in modern processors are both far slower and vastly more energy-expensive than the actual computations. To improve performance, processors spend a significant amount of energy and resources trying to hide and reduce the memory latency. To hide the latency, processors use out-order-order execution to overlap memory accesses with independent work and aggressive speculative instruction scheduling to execute dependent instructions back-to-back. To reduce the latency, processors use several levels of caching that keep frequently used data closer to the processor. However, these optimizations are not for free. Out-of-order execution requires expensive processor resources, and speculative scheduling must re-execute instructions on incorrect speculations, and multi-level caching requires extra energy and latency to search the cache hierarchy. This thesis investigates several energy-efficient techniques for: 1) hiding the latency in the processor pipeline, and 2) reducing the latency in the memory hierarchy.

Much of the inefficiencies of hiding latency in the processor come from two sources. First, processors need several large and expensive structures to do out-of-order execution (instructions queue, register file, etc.). These resources are typically allocated in program order, effectively giving all instructions equal priority. To reduce the size of these expensive resources without hurting performance, we propose Long Term Parking (LTP). LTP parks non-critical instructions before they allocate resources, thereby making room for critical memory accessing instructions to continue and expose more memory-level parallelism. This enables us to save energy by shrinking the resources sizes without hurting performance. Second, when a load's data returns, the load's dependent instructions need to be scheduled and executed. To execute the dependent instructions back-to-back, the processor will speculatively schedule instructions before the processor knows if the input data will be available at execution time. To save energy, we investigate different scheduling techniques that reduce the number of re-executions due to misspeculation.

The inefficiencies of traditional memory hierarchies come from the need to do level-by-level searches to locate data. The search starts at the L1 cache, then proceeds level by level until the data is found, or determined not to be in any cache, at which point the processor has to fetch the data from main memory. This wastes time and energy for every level that is searched. To reduce the latency, we propose tracking the location of the data directly in a separate metadata hierarchy. This allows us to directly access the data without needing to search. The processor simply queries the metadata hierarchy for the location information about where the data is stored. Separating metadata into its own hierarchy brings a wide range of additional benefits, including flexibility in how we place data storages in the hierarchy, the ability to intelligently store data in the hierarchy, direct access to remote cores, and many other data-oriented optimizations that can leverage our precise knowledge of where data are located.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2016. , 70 p.
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1450
National Category
Computer Systems Computer Science
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:uu:diva-306369ISBN: 978-91-554-9744-6OAI: oai:DiVA.org:uu-306369DiVA: diva2:1040387
Public defence
2016-12-15, Gustavianum, Akademigatan 3, Uppsala, 09:15 (English)
Opponent
Supervisors
Available from: 2016-11-23 Created: 2016-10-27 Last updated: 2016-11-28
List of papers
1. Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors
Open this publication in new window or tab >>Long Term Parking (LTP): Criticality-aware Resource Allocation in OOO Processors
Show others...
2015 (English)In: Proc. 48th International Symposium on Microarchitecture, 2015Conference paper (Refereed)
Abstract [en]

Modern processors employ large structures (IQ, LSQ, register file, etc.) to expose instruction-level parallelism (ILP) and memory-level parallelism (MLP). These resources are typically allocated to instructions in program order. This wastes resources by allocating resources to instructions that are not yet ready to be executed and by eagerly allocating resources to instructions that are not part of the application’s critical path.

This work explores the possibility of allocating pipeline resources only when needed to expose MLP, and thereby enabling a processor design with significantly smaller structures, without sacrificing performance. First we identify the classes of instructions that should not reserve resources in program order and evaluate the potential performance gains we could achieve by delaying their allocations. We then use this information to “park” such instructions in a simpler, and therefore more efficient, Long Term Parking (LTP) structure. The LTP stores instructions until they are ready to execute, without allocating pipeline resources, and thereby keeps the pipeline available for instructions that can generate further MLP.

LTP can accurately and rapidly identify which instructions to park, park them before they execute, wake them when needed to preserve performance, and do so using a simple queue instead of a complex IQ. We show that even a very simple queue-based LTP design allows us to significantly reduce IQ (64 →32) and register file (128→96) sizes while retaining MLP performance and improving energy efficiency.

National Category
Computer Engineering
Identifiers
urn:nbn:se:uu:diva-272468 (URN)
Conference
MICRO 2015, December 5–9, Waikiki, HI
Projects
UPMARCUART
Available from: 2016-01-14 Created: 2016-01-14 Last updated: 2016-11-07
2. Cost-effective speculative scheduling in high performance processors
Open this publication in new window or tab >>Cost-effective speculative scheduling in high performance processors
Show others...
2015 (English)In: Proc. 42nd International Symposium on Computer Architecture, New York: ACM Press, 2015, 247-259 p.Conference paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2015
National Category
Computer Systems
Identifiers
urn:nbn:se:uu:diva-272467 (URN)10.1145/2749469.2749470 (DOI)978-1-4503-3402-0 (ISBN)
Conference
ISCA 2015, June 13–17, Portland, OR
Projects
UPMARCUART
Available from: 2015-06-13 Created: 2016-01-14 Last updated: 2016-11-07
3. TLC: A tag-less cache for reducing dynamic first level cache energy
Open this publication in new window or tab >>TLC: A tag-less cache for reducing dynamic first level cache energy
2013 (English)In: Proceedings of the 46th International Symposium on Microarchitecture, New York: ACM Press, 2013, 49-61 p.Conference paper (Refereed)
Abstract [en]

First level caches are performance-critical and are therefore optimized for speed. To do so, modern processors reduce the miss ratio by using set-associative caches and optimize latency by reading all ways in parallel with the TLB and tag lookup. However, this wastes energy since only data from one way is actually used.

To reduce energy, phased-caches and way-prediction techniques have been proposed wherein only data of the matching/predicted way is read. These optimizations increase latency and complexity, making them less attractive for first level caches.

Instead of adding new functionality on top of a traditional cache, we propose a new cache design that adds way index information to the TLB. This allow us to: 1) eliminate ex-tra data array reads (by reading the right way directly), 2) avoid tag comparisons (by eliminating the tag array), 3) later out misses (by checking the TLB), and 4) amortize the TLB lookup energy (by integrating it with the way information). In addition, the new cache can directly replace existing caches without any modication to the processor core or software.

This new Tag-Less Cache (TLC) reduces the dynamic energy for a 32 kB, 8-way cache by 60% compared to a VIPT cache without aecting performance.

Place, publisher, year, edition, pages
New York: ACM Press, 2013
National Category
Computer Engineering Computer Systems
Identifiers
urn:nbn:se:uu:diva-213236 (URN)10.1145/2540708.2540714 (DOI)978-1-4503-2638-4 (ISBN)
Conference
MICRO-46; December 7-11, 2013; Davis, CA, USA
Projects
UPMARCCoDeR-MP
Available from: 2013-12-07 Created: 2013-12-19 Last updated: 2016-10-27Bibliographically approved
4. The Direct-to-Data (D2D) Cache: Navigating the cache hierarchy with a single lookup
Open this publication in new window or tab >>The Direct-to-Data (D2D) Cache: Navigating the cache hierarchy with a single lookup
2014 (English)In: Proc. 41st International Symposium on Computer Architecture, Piscataway, NJ: IEEE Press, 2014, 133-144 p.Conference paper (Refereed)
Abstract [en]

Modern processors optimize for cache energy and performance by employing multiple levels of caching that address bandwidth, low-latency and high-capacity. A request typically traverses the cache hierarchy, level by level, until the data is found, thereby wasting time and energy in each level. In this paper, we present the Direct-to-Data (D2D) cache that locates data across the entire cache hierarchy with a single lookup.

To navigate the cache hierarchy, D2D extends the TLB with per cache-line location information that indicates in which cache and way the cache line is located. This allows the D2D cache to: 1) skip levels in the hierarchy (by accessing the right cache level directly), 2) eliminate extra data array reads (by reading the right way directly), 3) avoid tag comparisons (by eliminating the tag arrays), and 4) go directly to DRAM on cache misses (by checking the TLB). This reduces the L2 latency by 40% and saves 5-17% of the total cache hierarchy energy.

D2D´s lower L2 latency directly improves L2 sensitive applications´ performance by 5-14%. More significantly, we can take advantage of the L2 latency reduction to optimize other parts of the microarchitecture. For example, we can reduce the ROB size for the L2 bound applications by 25%, or we can reduce the L1 cache size, delivering an overall 21% energy savings across all benchmarks, without hurting performance.

Place, publisher, year, edition, pages
Piscataway, NJ: IEEE Press, 2014
National Category
Computer Engineering Computer Science
Identifiers
urn:nbn:se:uu:diva-235362 (URN)10.1145/2678373.2665694 (DOI)000343652800012 ()978-1-4799-4394-4 (ISBN)
Conference
ISCA 2014, June 14–18, Minneapolis, MN
Projects
UPMARCCoDeR-MP
Available from: 2014-06-14 Created: 2014-10-31 Last updated: 2016-10-27Bibliographically approved
5. A Split Cache Hierarchy for Enabling Data-oriented Optimizations
Open this publication in new window or tab >>A Split Cache Hierarchy for Enabling Data-oriented Optimizations
2017 (English)Conference paper (Refereed)
Abstract [en]

Today’s caches tightly couple data with metadata at the cache line granularity. One result of this is that they require a multitude of different solutions for finding data (associative way searches and level-by-level searches), evictions (coherent writebacks buffers and associative level-by-level searches) and coherence (directory indirections and associative levelby-level searches) due to the co-location of the data and its identifying metadata. This results in complex implementations with many corner cases, increased latency and energy, and limited flexibility for data optimizations.

We propose splitting the metadata and data into two separate cache hierarchies. The metadata hierarchy tracks the location of the data and makes the hierarchy more efficient and flexible as we need only keep the metadata coherent. This allows us to easily apply many different cache optimizations on the data hierarchy, including smart data placement, dynamic coherence, and direct accesses.

The new split cache hierarchy, Direct-to-Master (D2M), provides a unified mechanism for cache searching, eviction, and coherence, all without the need for indirections through a central directory, level-by level data movement and searches, and with no associative cache address tags for locating data. Optimizations such as near-side LLC slices and private/shared data classification can easily be built on top off D2M to further improve its efficiency. This approach delivers a 54% improvement in cache hierarchy EDP vs. a mobile processor and 40% vs. a server processor, reduces network traffic by an average of 65%, and is especially effective for workloads with large instruction footprints.

National Category
Computer Systems Computer Science
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-306368 (URN)
Conference
International Symposium on High Performance Computer Architecture (HPCA)
Projects
UART
Available from: 2016-10-27 Created: 2016-10-27 Last updated: 2016-10-27

Open Access in DiVA

fulltext(836 kB)35 downloads
File information
File name FULLTEXT01.pdfFile size 836 kBChecksum SHA-512
812a448d86a4d6805a092b31ecdd747728d2fed9a2188ce49751855ae939e383793d38b53f9f0cd2b68930c4a1c7e4c8f63c361b1a016774135cb7a8d60b0951
Type fulltextMimetype application/pdf
Buy this publication >>

Search in DiVA

By author/editor
Sembrant, Andreas
By organisation
Division of Computer SystemsComputer Systems
Computer SystemsComputer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 35 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 151 hits
ReferencesLink to record
Permanent link

Direct link