Cache Coherence

"Order, Order, Order in the court " , Said the Speaker.

Cache coherence ensures that each processors see a consistent view of memory. what properties must be enforced among reads and writes to different caches by different CPUs? What we want is order!

Morden Computers

Morden CPUs Runs way faster than references, thus direct-io cannot exploit the power of CPU exhaustly.

Computer pioneers correctly make it to Memory Hierarchy.

Memory Hierarchy

Data flows among the CPU’s main memory and caches in fixed length blocks aka “cache line”. Normally blocks’ size is a power of 2, ranged from 16 to 256 bytes.

When a given data item is first accessed by a given CPU, it will be absent from that CPU’s cache, meaning that a “cache miss”, Now that there is a warmup for most of data-intensive applications. The cache miss means that the CPU will have to wait (or be “stalled”) for hundreds of cycles while the item is fetched from memory. However, the item will be loaded into that CPU’s cache, so that subsequent accesses will find it in the cache and therefore run at full speed.

As I have mentioned in my last blog, we have been considering only cases where a CPU reads a data item. You might wonder what happens when it comes a cache writting.

Aha, just fetch from the cache end processed by the CPU then stores it back to cache line?

Given that morden CPU chip consists of multi-cores, L1,L2 is owned by each core while L3 shared by all, a data item’ s multi-core reference might lead to a race condition.

So , much care must be taken to ensure that all CPUs maintain a coherent view of data.With all this fetching, invalidating, and writing, it is easy to imagine data being lost or (perhaps worse) different CPUs having conflicting values for the same data item in their respective caches. These problems are prevented by “cache-coherency proto- cols”, described next.

Cache-coherence Protocols ——MESI

States

Modifed

A line in the “modified” state has been subject to a recent memory store from the corresponding CPU, and the corresponding memory is guaranteed not to appear in any other CPU’s cache. Cache lines in the “modified” state can thus be said to be “owned” by the CPU. Because this cache holds the only up-to-date copy of the data, this cache is ultimately responsible for either writing it back to memory or handing it off to some other cache, and must do so before reusing this line to hold other data.

Exclusive

The “exclusive” state is very similar to the “modified” state, the single exception being that the cache line has not yet been modified by the corresponding CPU, which in turn means that the copy of the cache line’s data that resides in memory is up-to-date. However, since the CPU can store to this line at any time, without consulting other CPUs, a line in the “exclusive” state can still be said to be owned by the corresponding CPU. That said, because the corresponding value in memory is up to date, this cache can discard this data without writing it back to memory or handing it off to some other CPU.

Shared

A line in the “shared” state might be replicated in at least one other CPU’s cache, so that this CPU is not permitted to store to the line without first consulting with other CPUs. As with the “exclusive” state, because the corresponding value in memory is up to date, this cache can discard this data without writing it back to memory or handing it off to some other CPU.

Invalid

A line in the “invalid” state is empty, in other words, it holds no data. When new data enters the cache, it is placed into a cache line that was in the “invalid” state if possible. This approach is preferred because replacing aline in any other state could result in an expensive cache miss should the replaced line be referenced in the future. Since all CPUs must maintain a coherent view of the data carried in the cache lines, the cache-coherence proto- col provides messages that coordinate the movement of cache lines through the system.

states diagram

a

A cache line is written back to memory, but the CPU retains it in its cache and further retains the right to modify it
b

The CPU writes to the cache line that it already had exclusive access to.
c

The CPU receives a “read invalidate” message for a cache line that it has modified. The CPU must invalidate its local copy, then respond with both a “read response” and an “invalidate acknowledge” message, both sending the data to the requesting CPU and indicating that it no longer has a local copy.
d

The CPU does an atomic read-modify-write operation on a data item that was not present in its cache. It transmits a “read invalidate”, receiving the data via a “readresponse”. TheCPUcancompletethetransition once it has also received a full set of “invalidate acknowledge” responses.
e

The CPU does an atomic read-modify-write operation on a data item that was previously read-only in its cache. It must transmit “invalidate” messages, and must wait for a full set of “invalidate acknowledge” responses before completing the transition
f

Some other CPU reads the cache line, and it is supplied from this CPU’s cache, which retains a read- only copy, possibly also writing it back to memory. This transition is initiated by the reception of a

“read” message, and this CPU responds with a “read response” message containing the requested data.
g

Some other CPU reads a data item in this cache line, and it is supplied either from this CPU’s cache or from memory. In either case, this CPU retains a read- only copy. This transition is initiated by the reception of a “read” message, and this CPU responds with a “read response” message containing the requested data.
h

This CPU realizes that it will soon need to write to some data item in this cache line, and thus transmits an “invalidate” message. The CPU cannot complete the transition until it receives a full set of “invalidate acknowledge” responses. Alternatively, all other CPUs eject this cache line from their caches via “writeback” messages (presumably to make room for other cache lines), so that this CPU is the last CPU caching it.
i

Some other CPU does an atomic read-modify-write operation on a data item in a cache line held only in this CPU’s cache, so this CPU invalidates it from its cache. This transition is initiated by the reception of a “read invalidate” message, and this CPU responds with both a “read response” and an “invalidate ac- knowledge” message.
j

This CPU does a store to a data item in a cache line that was not in its cache, and thus transmits a “read invalidate” message. The CPU cannot complete the transition until it receives the “read response” and a full set of “invalidate acknowledge” messages. The cache line will presumably transition to “modified” state via transition (b) as soon as the actual store completes.
k

This CPU loads a data item in a cache line that was not in its cache. The CPU transmits a “read” message, and completes the transition upon receiving the corresponding “read response”.
l

Some other CPU does a store to a data item in this cache line, but holds this cache line in read-only state due to its being held in other CPUs’ caches (such as the current CPU’s cache). This transition is initiated by the reception of an “invalidate” message, and this CPU responds with an “invalidate acknowledge” message.

How MESI really Works

Lets start with a 4-core CPU with on-chip caches.

Initially, empty cache line means invalid for referencing.
CPU 0 loads data@0x0, and changes the state to “Shared”.
CPU 3 loads data@0x0, and changes the state to “Shared”.
CPU 0 loads data@0x8, and invalidates the current cache line (data@0x0) then loads data@0x8 with state turns “Shared”.
CPU 2 loads data@0x0, which is already cached in CPU3 but will soon be changed / stored, firstly itinvalidates it from CPU3-cache, and loads a exclusive copy.
CPU 2 stores data@0x0 which happens in cache line, changing the state to “Modified”.
CPU 1 wants increment data@0x0, snoop data from CPU2 at first and “invalidates” CPU2’s cacheline and then stores data@0x0 + 1 and “Modified” it’s state.
CPU1 needs data@0x8, evict “Modified” current data@0x0 to memory by “WriteBack” message to transfer current cacheline back main memory, load data@0x8 marked as “Shared”.

However, first write to a certain cache line is fairly poor. As shown in the pic below, CPU0 is to write to data cached in CPU1. Since CPU 0 must wait for the cache line to arrive before it can write to it, CPU 0 must stall for an extended period of time.

That’s weird, cause it’ll be overwrited anyhow. There is no reason to stall for so long.