11.Optimizing Cache Performance

Basic Tricks to Improve Performance

  • Reducing miss rate

    • More associativity

    • Alternatives/enhancements to associativity

      • Victim caches, hashing, pseudo-associativity, skewed associativity

    • Better replacement/insertion policies

    • Software approaches

  • Reducing miss latency/cost

    • Multi-level caches

    • Critical word first

    • Subblocking/sectoring

    • Better replacement/insertion policies

    • Non-blocking caches (multiple cache misses in parallel)

    • Issues in multicore caches

Ways of Reducing Conflict Misses

  • Instead of building highly-associative caches, many other approaches in the literature:

    • Victim Caches

    • Hashed/randomized Index Functions

    • Pseudo Associativity

    • Skewed Associative Caches

Victim Cache: Reducing Conflict Misses

注: 加入了一个 Victim Cache. 拥有 4 或 8 个条目用于存储被 Cache 踢出的块(避免高几率访问块反复踢出/调用. e.g.ping-pong effect: ABABABAB).

  • Jouppi, “`Improving Direct-Mapped Cache Performance by the

    Addition of a Small Fully-Associative Cache and Prefetch Buffers`,”

    ISCA 1990. 【想深度理解,请阅读原文。 】

  • Idea: *Use a small fully associative buffer (victim cache) to store

    evicted blocks*

    • +Can avoid ping ponging of cache blocks mapped to the same set (if two cache

      blocks continuously accessed in nearby time conflict with each other)

    • --Increases miss latency if accessed serially with L2;

    • --Adds complexity

Hashing and Pseudo-Associativity

  • Hashing: Use better “randomizing” index functions

    • +can reduce conflict misses

      • by distributing the accessed memory blocks more evenly to sets

      • Example of conflicting accesses: stride access pattern where stride value equals number of sets in cache

    • --More complex to implement: can lengthen critical path

  • Pseudo-associativity(伪相连)

    • View each "way" of the set as a separate direct-mapped cache

    • Ways are searched in sequence (first, second, …)

      • On a hit in the kth way, then the line is promoted to the first way, and all lines in caches 1 to k-1 get demoted by one.

      • On a miss, the item is filled in the first way, the item in the nth way is evicted, and all items in the 1 to n-1 way

Skewed Associative Caches

  • Idea: Reduce conflict misses by using different index functions for each cache way

  • Seznec, “A Case for Two-Way Skewed-Associative(倾斜相连) Caches,” ISCA 1993.

Basic 2-way associative cache structure

注: Way0 和 Way1 一一对应.

Each bank has a different index function(哈希函数)

注: Way0 和 Way1 使用不同哈希函数, 组的概念是动态的, 更加随机均衡.

  • Idea: Reduce conflict misses by using different index functions for each cache way

  • Benefit: indices are more randomized (memory blocks are better distributed across sets) due to hashing

    • Less likely two blocks have same index

      • Reduced conflict misses

  • Cost: additional latency of hash function

Example Software Approaches

  • Restructuring data access patterns

  • Restructuring data layout

    • Loop interchange

    • Data structure separation/merging

    • Blocking

Restructuring Data Access Patterns

  • Idea: Restructure data layout or access patterns

  • Example: If column-major

    • x[i+1,j] follows x[i,j] in memory

    • x[i,j+1] is far away from x[i,j]

    // Poor code
    for i = 1, rows
      for j = 1, columns
        sum = sum + x[i, j]
    // Better code
    for j = 1, columns
      for i = 1, rows
        sum = sum + x[i, j]

    注: Poor code 按行访问, 不符合列式优先, 效率很低对缓存不友好.

  • This is called loop interchange

  • Other optimizations can also increase hit rate

    • Loop fusion, array merging, … 【请看张晨曦老师的教材】

  • What if multiple arrays? Unknown array size at compile time?

  • Blocking for better cache utilization 【示例见教材】

    • Divide loops operating on arrays into computation chunks so that each chunk can hold its data in the cache

    • Avoids cache conflicts between different chunks of computation

    • Essentially: Divide the working set so that each piece fits in the cache

  • But, there are still self-conflicts in a block

    1. there can be conflicts among different arrays

    2. array sizes may be unknown at compile/programming time

Restructuring Data Layout

struct Node {
  struct Node* node;
  int key;
  char [256] name;
  char [256] school;
}

while (node) {
  if (node->key == input-key) {
    // access other fields of node
  }
  node = node->next;
}
  • Pointer based traversal (e.g., of a linked list)

  • Assume a huge linked list (1M nodes) and unique keys

  • Why does the code on the left have poor cache hit rate?

    • “Other fields” occupy most of the cache line even though rarely accessed!\

      注: Cache 中有效的只有 Node 部分, 其他部分的数据是无效的.

struct Node {
  struct Node* node;
  int key;
  struct Node-data* node-data;
}
struct Node-data {
  char [256] name;
  char [256] school;
}
while (node) {
  if (node->key == input-key) {
    // access node->node-data
  }
  node = node->next;
}
  • Idea: separate frequentlyused fields of a data structure and pack them into a separate data structure

  • Who should do this?

    • Programmer

    • Compiler

  • Profiling vs. dynamic

    • Hardware?

    • Who can determine what is frequently used?

Improving Basic Cache Performance

  • Reducing miss rate

    • More associativity

    • Alternatives/enhancements to associativity

  • Victim caches, hashing, pseudo-associativity, skewed associativity

    • Better replacement/insertion policies

    • Software approaches

  • Reducing miss latency/cost

    • Multi-level caches

    • Critical word first

    • Subblocking/sectoring

    • Better replacement/insertion policies 【若感兴趣,请联系我】

    • Non-blocking caches (multiple cache misses in parallel)

    • Issues in multicore caches

Miss Latency/Cost

  • What is miss latency or miss cost affected by?

    • Where does the miss get serviced from?

      • Local vs. remote memory

      • What level of cache in the hierarchy?

      • Row hit versus row miss

      • Queueing delays in the memory controller

    • How much does the miss stall the processor?

      • Is it overlapped with other latencies?

      • Is the data immediately needed?

Demo: Memory Level Parallelism (MLP)

注: B 和 C 是不同的两个缺失, 可以同时处理. 相比 A 效率更佳.

  • Memory Level Parallelism (MLP) means generating and servicing multiple memory accesses in parallel [Glew’98]

  • Several techniques to improve MLP (e.g., out-of-order execution)

  • MLP varies. Some misses are isolated and some parallel

  • How does this affect cache replacement?

Traditional Cache Replacement Policies

  • Traditional cache replacement policies try to reduce miss count

  • Implicit assumption: Reducing miss count reduces memory-related stall time

  • Misses with varying cost/MLP breaks this assumption!

    • Eliminating an isolated miss helps performance more than eliminating a parallel miss

    • Eliminating a higher-latency miss could help performance more than eliminating a lower-latency miss

An Example

Misses to blocks P1, P2, P3, P4 can be parallel

Misses to blocks S1, S2, and S3 are isolated

Two replacement algorithms:

  1. Minimizes miss count (Belady’s OPT)

  2. Reduces isolated miss (MLP-Aware)

For a fully associative cache containing 4 blocks

Fewest Misses != Best Performance

MLP-Aware Cache Replacement

  • How do we incorporate MLP into replacement decisions?

  • Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2006.

    • Suggested reading for homework

Enabling Multiple Outstanding Misses

Handling Multiple Outstanding Accesses

  • Question: If the processor can generate multiple cache accesses, can the later accesses be handled while a previous miss is outstanding?

  • Goal I: Enable cache access when there is a pending miss

  • Goal II: Enable multiple misses in parallel

    • Memory-level parallelism (MLP)

  • Solution: Non-blocking or lockup-free caches

    • Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization," ISCA 1981.

  • Idea: Keep track of the status/data of misses that are being handled using Miss Status Handling Registers (MSHR)

    • A cache access checks MSHR to see if a miss to the same block is already pending.

      • If pending, a new request is not generated

      • If pending and the needed data available, data forwarded to later load

    • Requires buffering of outstanding miss requests

Miss Status Handling Register

  • Also called “miss buffer”

  • Keeps track of

    • Outstanding cache misses

    • Pending load/store accesses that refer to the missing cache block

  • Fields of a single MSHR entry

    • Valid bit

    • Cache block address (to match incoming accesses)

    • Control/status bits (prefetch, issued to memory, which subblocks have arrived, etc)

    • Data for each sub-block

    • For each pending load/store, keep track of

      • Valid, type, data size, byte in block, destination register or store buffer entry address

MSHR Entry Demo

MSHR Operation

  • On a cache miss:

    • Search MSHRs for a pending access to the same block

      • Found: Allocate a load/store entry in the same MSHR entry

      • Not found: Allocate a new MSHR

      • No free entry: stall - structure hazard

  • When a sub-block returns from the next level in memory

    • Check which loads/stores waiting for it

      • Forward data to the load/store unit

      • Deallocate load/store entry in the MSHR entry

    • Write sub-block in cache or MSHR

    • If last sub-block, deallaocate MSHR (after writing the block in cache)

Non-Blocking Cache Implementation

  • When to access the MSHR?

    • In parallel with the cache?

    • After cache access is complete?

  • MSHR need not be on the critical path of hit requests

    • Which one below is the common case?

  • Cache miss, MSHR hit

  • Cache hit

Multi-Core Issues in Caching

Caches in Multi-Core Systems

  • Cache efficiency becomes even more important in a multi-core/multi-threaded system

    • Memory bandwidth is at premium

    • Cache space is a limited resource

  • How do we design the caches in a multi-core system?

  • Many decisions

    • Shared vs. private caches

    • How to maximize performance of the entire system?

    • How to provide QoS to different threads in a shared cache?

    • Should cache management algorithms be aware of threads?

    • How should space be allocated to threads in a shared cache?

Private vs. Shared Caches

  • Private cache: Cache belongs to one core (a shared block can be in multiple caches)

  • Shared cache: Cache is shared by multiple cores

Resource Sharing and Its Advantages

  • Idea: Instead of dedicating a hardware resource to a hardware context, allow multiple contexts to use it

    • Example resources: functional units, pipeline, caches, buses, memory

  • Why?

    • +Resource sharing improves utilization/efficiency -> throughput

      • When a resource is left idle by one thread, another thread can use it; no need to replicate shared data

    • +Reduces communication latency

      • For example, shared data kept in the same cache in multithreaded processors

    • +Compatible with the shared memory model

Resource Sharing Disadvantages

  • Resource sharing results in contention for resources

    • When the resource is not idle, another thread cannot use it

    • If space is occupied by one thread, another thread needs to re-occupy it

  • Sometimes reduces each or some thread’s performance

    • Thread performance can be worse than when it is run alone

  • Eliminates performance isolation -> inconsistent performance across runs 【实时系统】

    • Thread performance depends on co-executing threads

  • Uncontrolled (free-for-all) sharing degrades QoS

    • Causes unfairness, starvation

    • Need to efficiently and fairly utilize shared resources

    • This is crucial for modern data center

Shared Caches Between Cores

  • Advantages:

    • High effective capacity

    • Dynamic partitioning of available cache space

  • No fragmentation due to static partitioning

    • Easier to maintain coherence (single copy)

    • Shared data and locks do not ping pong between caches

  • Disadvantages

    • Slower access

    • Cores incur conflict misses due to other cores’ accesses

  • Misses due to inter-core interference

  • Some cores can destroy the hit rate of other cores

    • Guaranteeing a minimum level of service (or fairness) to each

      core is harder (how much space, how much bandwidth?)

Shared Caches: How to Share

  • Free-for-all sharing

    • Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU)

    • Not thread/application aware

    • An incoming block evicts a block regardless of which threads the blocks belong to

  • Problems

    • Inefficient utilization of cache: LRU is not the best policy

    • A cache-unfriendly application can destroy the performance of a cache friendly application

    • Not all applications benefit equally from the same amount of cache: free-for-all might prioritize those that do not benefit

    • Reduced performance, reduced fairness

Optimization: Utility Based Cache Partitioning

  • Goal: Maximize system throughput

  • Observation: Not all threads/applications benefit equally from caching -> simple LRU replacement not good for system throughput

  • Idea: Allocate more cache space to applications that obtain the most benefit from more space

  • The high-level idea can be applied to other shared resources as well.

  • Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, HighPerformance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.

  • Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002

The Multi-Core System: A Shared Resource View

Need for QoS and Shared Resource Mgmt

  • Why is unpredictable performance (or lack of QoS) bad? (think about it!)

  • Makes programmer’s life difficult

    • An optimized program can get low performance (and performance varies widely depending on co-runners)

  • Causes discomfort to user

    • An important program can starve

    • Examples from shared software resources

  • Makes system management difficult

    • How do we enforce a Service Level Agreement when hardware resources are sharing is uncontrollable?

Resource Sharing vs. Partitioning

  • Sharing improves throughput

    • Better utilization of space

  • Partitioning provides performance isolation (predictable performance)

    • Dedicated space

  • Can we get the benefits of both?

  • Idea: Design shared resources such that they are efficiently utilized, controllable and partitionable

    • No wasted resource + QoS mechanisms for threads

Shared Hardware Resources

  • Memory subsystem (in both multithreaded and multicore systems)

    • Non-private caches

    • Interconnects

    • Memory controllers, buses, banks

  • I/O subsystem (in both multithreaded and multi-core

    systems)

    • I/O, DMA controllers

    • Ethernet controllers

  • Processor (in multithreaded systems)

    • Pipeline resources

    • L1 caches

Last updated