ACE Journal

Cache Hierarchy Trade-offs in Modern CPUs

Abstract:
This article examines the design choices in multi-level cache hierarchies, exploring trade-offs between latency, capacity, power consumption, and die area. It analyzes how L1, L2, and L3 caches interact to optimize performance and reduce memory bottlenecks for various workloads. Benchmark results and case studies demonstrate practical implications for contemporary CPU architectures.

Introduction

Modern processors rely heavily on cache hierarchies to bridge the performance gap between the CPU core and main memory. As CPU designs have evolved, on-die cache sizes have grown, and multi-level hierarchies (L1 → L2 → L3) have become the norm. Each level in this hierarchy serves distinct roles: small, ultra-fast L1 caches aim to supply data at register-speed, while larger, slightly slower L2 and L3 caches aim to capture a higher volume of working set data. However, increasing cache size and depth introduces trade-offs in latency, power, and die area. In this article, we explore how architects balance these competing factors and present benchmark results to illustrate real-world impacts.

Cache Basics

Before diving into multi-level trade-offs, it is essential to review basic cache concepts:

Multi-Level Cache Hierarchy

A typical three-level cache hierarchy consists of:

  1. L1 Cache (Instruction + Data)
    • Size: Often 32 KB – 64 KB per core.
    • Associativity: Typically 8-way or higher.
    • Latency: ~3 – 5 cycles.
    • Role: Capture the hottest working set; feed the CPU at register-speed.
  2. L2 Cache (Unified or Split)
    • Size: 256 KB – 1 MB per core.
    • Associativity: Often 4-way or 8-way.
    • Latency: ~10 – 15 cycles.
    • Role: Provide higher capacity for working sets that cannot fit in L1; typically private per core.
  3. L3 Cache (Last-Level Cache, LLC)
    • Size: 4 MB – 64 MB shared across all cores (depending on design).
    • Associativity: 16-way or higher.
    • Latency: ~30 – 50 cycles.
    • Role: Capture large working sets across threads; interface to main memory controller.

How Levels Interact

When the CPU core issues a load or store:

  1. Probe L1: On a hit, data is returned in ~3 cycles.
  2. If L1 Miss → Probe L2: A hit in L2 incurs ~10 cycles. If still a miss:
  3. Probe L3: A hit in L3 incurs ~30–50 cycles.
  4. If L3 Miss → Access DRAM: Main memory access costs ~150 – 200 cycles or more, depending on DRAM speed and contention.

Each cache level reduces average memory access time (AMAT), but adding capacity and associativity tends to increase hit latency. Architects must weigh the relative benefits of lower miss rates against higher hit latencies.

Latency vs. Capacity Trade-Offs

L1 Cache

Most core designs settle on 32 KB – 64 KB to balance latency and miss rate for typical integer and floating-point workloads. Doubling L1 size often yields marginal miss-rate improvement at a disproportionate latency and power cost.

L2 Cache

Designers must assess target workloads (e.g., single-threaded SPECcpu vs. multi-threaded server loads) to choose an L2 size. Many modern CPU cores use 512 KB – 1 MB L2 as a compromise.

L3 Cache (Last-Level Cache)

Because L3 is shared, its size choice depends on core count and target application domains. For server processors running multi-tenant applications, a larger LLC reduces DRAM pressure. In mobile SoCs, LLC may be smaller (4 – 8 MB) to conserve silicon budget and power.

Power and Die-Area Considerations

While larger caches reduce access to DRAM (which can cost ~1 – 2 nJ per access), they also increase on-die silicon area and dynamic/leakage power:

  1. Dynamic Power: Proportional to access voltage squared (V²), frequency, and capacitance (which grows with array size).
  2. Leakage Power: Grows exponentially as feature sizes shrink; larger arrays exacerbate leakage.

Architects often mitigate power costs through:

Trading off die area for performance also impacts yield: large cache macros increase the die’s defect rate. In mobile and embedded domains, smaller caches are often preferred to maintain cost targets.

Case Studies and Benchmarks

Below are two illustrative case studies comparing different cache configurations on a 4-core CPU running a mix of desktop and server workloads:


Case Study 1: Desktop-Focused Core


Case Study 2: Server-Focused Core


Practical Implications for Architects

  1. Identify Target Workloads
    • Latency-Sensitive (e.g., gaming, HPC): Favor smaller, fast L1/L2 caches to minimize access latency.
    • Throughput-Oriented (e.g., data centers, ML inference): Invest in larger LLC (L3) to reduce DRAM bandwidth pressure.
  2. Balance Cache Levels
    • Over-investing in L1 or L2 without sufficient L3 capacity can lead to high DRAM traffic.
    • Conversely, a massive L3 with tiny L1/L2 may suffer high average hit latencies.
  3. Power Budget Constraints
    • In mobile/embedded SoCs, limit L3 size (< 8 MB) and use low-voltage SRAM to reduce leakage.
    • Implement selective power gating or way-prediction to curb dynamic power.
  4. Die-Area and Cost
    • Larger caches increase silicon area and production cost.
    • Use process node scaling (e.g., 7 nm, 5 nm) to shrink macro sizes, but leakage grows at smaller nodes.
  5. Future Trends
    • 3D-Stacked DRAM (HBM) may alter cache hierarchies by offering higher-bandwidth, lower-latency memory near compute tiles.
    • On-Package Caches: Emerging architectures place large LLC in a separate cache die, allowing independent scaling.
    • Non-Volatile Caches: Emerging technologies (e.g., MRAM, RRAM) could reduce leakage in large caches.

Conclusion

Designing an effective cache hierarchy demands careful trade-offs among latency, capacity, power, and die area. L1 caches must remain small and fast, while L2 and L3 sizes should reflect workload requirements. Benchmark data shows that modest increases in L2/L3 often yield tangible performance gains, but at the cost of higher power and silicon area. As semiconductor technologies evolve (smaller geometries, 3D stacking, novel memory cells), architects must revisit traditional trade-offs to achieve optimal balance. A nuanced understanding of how each cache level interacts is essential to building CPUs that meet diverse performance and power targets in both desktop and server domains.

References

  1. Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann.
  2. Intel Corporation. (2020). Intel® 64 and IA-32 Architectures Optimization Reference Manual.
  3. Wu, J., & Song, Y. (2019). “Evaluating the Impact of Cache Associativity on CMP Performance,” IEEE Transactions on Computers, 68(4), 623–637.
  4. Nvidia Corporation. (2021). “Whitepaper: GPU Memory Hierarchy and Cache.”