My knowledge on the cpu arch was pretty out-dated, so I’ve made some quick research on cache. And here’re something I learned:
Cache loads data in units of cache line, all cache line is 64bytes in L1-L3 (recent Intel cpus)
Sequence access and stride access would make cache to do prefetch so we can avoid cache miss even if we access more than one cache line
And here’re something I don’t quite understand:
What’s the point of 16KB chunk? Does that mean there would be no cache miss if we randomly access within a chunk? Or is it that there would still be (16KB/64=256) cache miss before we fill the whole chunk into the L1 cache?
No, we are not aiming random access without cache miss. We are aiming to minimize the cache miss and try to process the data in sequence. This guarantees that the prefetch will load our next chunk of data so cache miss will be rare.
No one promised no cache miss over random access AFAIK.
As Lurking-Ninja said, its not about guranteeing zero cachemiss. It is impossible to achieve that on a program as a whole, it is about massively reducing it and massively reducing the amount of bandwidth caused by loading memory into cache that is then not consumed.
16KB means if you have for example 4 components on entity roughly 62 cache lines for each component per chunk laid out linearly. Thats a huge improvement over otherwise randomly jumping around in cache lines.
The exact 16KB chunk size was determined empirically on megacity. Ultimately we do want it to be dynamic. At least introducing the concept of tiny & large chunks. We have definitely noticed that on both extremes (Huge amounts of entities, small amounts of entities of one archetype) efficiecency could be improved.
Thanks for the clarification. There’re still quite some points I don’t quite understand about the cache.
Just take the random access within a chunk in title as example, say if a IJobChunk randomly access the data in the chunk 100k times ( an iterative algorithm on graph maybe), is it mostly right that we would only have 256 cache miss during the job execution?
I’ve another question about cache prefetch (strided prefetch) here.
Assume we cache some data of 100K elements (each 128bytes) in an array. Is it right that if we loop to access an int member in each of the element, the cache will basically not miss as we are accessing address in a fixed stride?
I think in general the book is a good one but unity people might have a reason for not recommending it (it could be included in the docs for example IMHO).
You’ll not agree with 100% of what he says, some of it might feel radical … but still good content