Ideally one would desire an indefinitely large memory capacity such that any particular… word would be immediately available… We are… forced to recognize the possibility of constructing a hierarchy of memories each of which has greater capacity than the preceding but which is less quickly accessible.
Cache is a damn compromise !
Introduction
Memory is a space holds instructions and data for the CPUs. Theoretically , Memory Space is a linear array of bytes. CPU runs extremely fast, which means normally data & intstructions access from memory is very slow.
Registers on chip only holds several bytes of data most frequently used by CPU, and can be accessed in 1 clock cycle. So Why not embed tons of Registers on the silicon chip around the ALU? Cause Intergrated Circuit Techniques cannot afford to do that.
IC pioneers Invented RAM ——SRAM(static)and DRAM(dynamic).SRAM is faster but more expensive than DRAM. SRAM cell consist of 6-transistor circuit which is larger than DRAM cell, which has only a capacitor. That’s why SRAM is around CPU on the chip.SRAM plays role in Cache Memory of CPU, DRAM dwells on the motherboard as a runtime memory space.
Different storage technologies have different price and performance trade-offs.
So you must wonder why does caches work?
Locality
Programs tend to reference datas that are near other recently.
The answer is locality. The quotation above is aka principle of locality, which has enormous impact on the deisgn of both hardware and software systems.
temporal locality
memory location that is referenced once is likely to be referenced again multiple times in the near future.
spacial locality
memory location is referenced once, then the program is likely to reference a nearby memory location in the near future.
Memory Hierachy
In general, the storage devices get slower, cheaper, and larger as we move from higher to lower levels.
At the highest level (L0) are a small number of fast CPU registers that the CPU can access in a single clock cycle. Next are one or more small to moderate-size SRAM-based cache memories that can be accessed in a few CPU clock cycles. These are followed by a large DRAM-based main memory that can be accessed in tens to hundreds of clock cycles. Next are slow but enormous local disks. Finally, some systems even include an additional level of disks on remote servers that can be accessed over a network. For example, distributed file systems such as the Andrew File System (AFS) or the Network File System (NFS) allow a program to access files that are stored on remote network-connected servers. Similarly, the World Wide Web allows programs to access remote files stored on Web servers anywhere in the world.
Caching
Theoretically, a cache is a tiny fast storage device acting as a starging arec for the lower-level larger but slower device. The process of using caches is known as caching
Datas & instructions are copied back and forth between level k and level k+1 in block-size transfer units all the time.
cache hits
When a program needs a particular data object d from level k + 1, it first looks for d in one of the blocks currently stored at level k. If d happens to be cached at level k, then we have what is called a cache hit. The program reads d directly from level k, which by the nature of the memory hierarchy is faster than reading d from level k + 1. For example, a program with good temporal locality might read a data object from block 14, resulting in a cache hit from level k.
cache misses
Opposite to cache hits , if data d is not cached at level k , we cannot find what we want, now there is a cache miss. Consequently, cache misses lead to a block-sized data transfer from level k+1 to level k, which might evicit an existing block, or might not.
So how does it choose a vicitim block to be evicited from caches, We must make it to a replacement policy , Such as LRU(least recently used) 、LFU(least frequently used) or MRU(Most recently used).
When a cache missed, system will trigger a interrupt to do a block-swap from the next lower level.
Data references with cache
Take Direct-Mapped Caches for an instance. Lets predigest the memroy hierachy to registers, L1 cache and a main memory.
There is exactly one line per set.
Consider a computer system with m-bits virual memory address (VA for short) has M = 2^m address space volume.
VA consist of 3 parts: Tag-bits, Set-bits, Block-offset:
-
Set-bits
2^s at most cache sets to find. Set-idnex tells us which set our target word stored in.
-
Tag-bits
tags-bit is a identifier to match target cache line.
-
Block-offsets
Cache line contains a chunk of bytes , the b block-offset shows us the offset from the entry of cache line.
Data reference while hitting
-
set selection
-
Line matching
-
Target word fetches.
Data references while missing
To be continue… we will discuss in later blogs(virtual memory)
Cache management
Briefly, storage device at each level is a cache for the next lower level.
There must be some form of policty or logics to manage the cache hardware & software. For example, cache partition into blocks, blocks transferring, misses and hits
- L0 , Register: ruled by compiler at compliation time.
- L1, L2, L3 : residen on chips, managed by built in hardware.
- Main memory: manged by both hardware(Address Translation) & software (OS).
- Disk: HDD / SSD is managed by their own driver program.
- Distributed Filesystem : Like HDFS, GFS and so on. Through hdfs-mount
memory mountain
The geography of the Core i7 mountain reveals a rich structure.
Writing Cache-Friendly Code
-
Small and Simple First-Level Caches to Reduce Hit Time and Power
-
Way Prediction to Reduce Hit Time
-
Pipelined Access and Multibanked Caches to Increase Bandwidth
-
Nonblocking Caches to Increase Cache Bandwidth
-
Critical Word First and Early Restart to Reduce Miss Penalty
-
Merging Write Buffer to Reduce Miss Penalty
-
Compiler Optimizations to Reduce Miss Rate
-
Hardware Prefetching of Instructions and Data to Reduce Miss Penalty or Miss Rate
-
Compiler-Controlled Prefetching to Reduce Miss Penalty or Miss Rate
-
Using HBM to Extend the Memory Hierarchy