Where there is a Memory Hierarchy, there is a Cache.

Cache is a damn compromise!

Posted by Wang Junwei on 2021-08-15
Memory Hierachy & Caches

Ideally one would desire an indefinitely large memory capacity such that any particular… word would be immediately available… We are… forced to recognize the possibility of constructing a hierarchy of memories each of which has greater capacity than the preceding but which is less quickly accessible.

Cache is a damn compromise !

Introduction

Memory is a space holds instructions and data for the CPUs. Theoretically , Memory Space is a linear array of bytes. CPU runs extremely fast, which means normally data & intstructions access from memory is very slow.

image-20221114092659396

Registers on chip only holds several bytes of data most frequently used by CPU, and can be accessed in 1 clock cycle. So Why not embed tons of Registers on the silicon chip around the ALU? Cause Intergrated Circuit Techniques cannot afford to do that.

IC pioneers Invented RAM ——SRAM(static)and DRAM(dynamic).SRAM is faster but more expensive than DRAM. SRAM cell consist of 6-transistor circuit which is larger than DRAM cell, which has only a capacitor. That’s why SRAM is around CPU on the chip.SRAM plays role in Cache Memory of CPU, DRAM dwells on the motherboard as a runtime memory space.

Different storage technologies have different price and performance trade-offs.

So you must wonder why does caches work?

Locality

Programs tend to reference datas that are near other recently.

The answer is locality. The quotation above is aka principle of locality, which has enormous impact on the deisgn of both hardware and software systems.

temporal locality

memory location that is referenced once is likely to be referenced again multiple times in the near future.

spacial locality

memory location is referenced once, then the program is likely to reference a nearby memory location in the near future.

Memory Hierachy

In general, the storage devices get slower, cheaper, and larger as we move from higher to lower levels.

At the highest level (L0) are a small number of fast CPU registers that the CPU can access in a single clock cycle. Next are one or more small to moderate-size SRAM-based cache memories that can be accessed in a few CPU clock cycles. These are followed by a large DRAM-based main memory that can be accessed in tens to hundreds of clock cycles. Next are slow but enormous local disks. Finally, some systems even include an additional level of disks on remote servers that can be accessed over a network. For example, distributed file systems such as the Andrew File System (AFS) or the Network File System (NFS) allow a program to access files that are stored on remote network-connected servers. Similarly, the World Wide Web allows programs to access remote files stored on Web servers anywhere in the world.

image-20221110104642915

Caching

Theoretically, a cache is a tiny fast storage device acting as a starging arec for the lower-level larger but slower device. The process of using caches is known as caching

image-20221114093447506

Datas & instructions are copied back and forth between level k and level k+1 in block-size transfer units all the time.

cache hits

When a program needs a particular data object d from level k + 1, it first looks for d in one of the blocks currently stored at level k. If d happens to be cached at level k, then we have what is called a cache hit. The program reads d directly from level k, which by the nature of the memory hierarchy is faster than reading d from level k + 1. For example, a program with good temporal locality might read a data object from block 14, resulting in a cache hit from level k.

cache misses

Opposite to cache hits , if data d is not cached at level k , we cannot find what we want, now there is a cache miss. Consequently, cache misses lead to a block-sized data transfer from level k+1 to level k, which might evicit an existing block, or might not.

So how does it choose a vicitim block to be evicited from caches, We must make it to a replacement policy , Such as LRU(least recently used) 、LFU(least frequently used) or MRU(Most recently used).

When a cache missed, system will trigger a interrupt to do a block-swap from the next lower level.

Data references with cache

Take Direct-Mapped Caches for an instance. Lets predigest the memroy hierachy to registers, L1 cache and a main memory.

image-20221114142550283

There is exactly one line per set.

Consider a computer system with m-bits virual memory address (VA for short) has M = 2^m address space volume.

image-20221114143547852

VA consist of 3 parts: Tag-bits, Set-bits, Block-offset:

  • Set-bits

    2^s at most cache sets to find. Set-idnex tells us which set our target word stored in.

  • Tag-bits

    tags-bit is a identifier to match target cache line.

  • Block-offsets

    Cache line contains a chunk of bytes , the b block-offset shows us the offset from the entry of cache line.

Data reference while hitting

  1. set selection

    image-20221114142904796

  2. Line matching

    image-20221114144344207

  3. Target word fetches.

Data references while missing

To be continue… we will discuss in later blogs(virtual memory)

Cache management

Briefly, storage device at each level is a cache for the next lower level.

There must be some form of policty or logics to manage the cache hardware & software. For example, cache partition into blocks, blocks transferring, misses and hits

  • L0 , Register: ruled by compiler at compliation time.
  • L1, L2, L3 : residen on chips, managed by built in hardware.
  • Main memory: manged by both hardware(Address Translation) & software (OS).
  • Disk: HDD / SSD is managed by their own driver program.
  • Distributed Filesystem : Like HDFS, GFS and so on. Through hdfs-mount

memory mountain

image-20221114150503134

The geography of the Core i7 mountain reveals a rich structure.


Writing Cache-Friendly Code

  1. Small and Simple First-Level Caches to Reduce Hit Time and Power

  2. Way Prediction to Reduce Hit Time

  3. Pipelined Access and Multibanked Caches to Increase Bandwidth

  4. Nonblocking Caches to Increase Cache Bandwidth

  5. Critical Word First and Early Restart to Reduce Miss Penalty

  6. Merging Write Buffer to Reduce Miss Penalty

  7. Compiler Optimizations to Reduce Miss Rate

  8. Hardware Prefetching of Instructions and Data to Reduce Miss Penalty or Miss Rate

  9. Compiler-Controlled Prefetching to Reduce Miss Penalty or Miss Rate

  10. Using HBM to Extend the Memory Hierarchy