ECS/DOTS is designed to take take advantage of cache sizes to turn processing batch based and multi-threaded to allow it to work across multiple cores at super fast speeds.
Could we end up with entire game worlds fitting within a CPUs cache?
Will DOTS 16k chunk limit actually start to be a bottleneck on high cache size CPUs?
Will DOTS even be needed once caches exceed a certain size or will the be vital for close to the metal processing e.g. by matching cache L1 and data registers?
This is pure speculation. Nobody’s even seen the hardware yet.
But I can also baselessly invent tons of reasons why DOTS will remain beneficial for decades.
Performance is just one aspect of DOTS. Others include improvements to code safety/readability/maintainability/testability/modularity/reusability. All of the ities!
Data locality is just one aspect of DOTS performance. Others include improved parallelism, vectorization, reduction of branch misses, and more optimal data allocation/deallocation (eg. via arena allocators). It is entirely possible that future hardware will be even more sensitive to those (especially parallelism! come back in ten years and tell me how many cores your machine has ).
Adoption of future hardware will likely be a slow process, and it will take many years/decades for it to make it into low-end PCs, mobile platforms, consoles, etc.
Improving data and code locality could still be worth it even with future cache. Hardware gets better, games get more demanding. Such is life.
For small games, that is already the case. For large games, no. The game worlds are just going to get bigger to compensate.
A page size is 4 kB, so entities with lots of components should probably be in chunks of 64 kB or so. Hardware prefetching gets killed at page boundaries.
L2 and higher are still at least an order of magnitude slower than registers. Sometimes even two orders of magnitude. Even the cache needs to be cached and so many of the DOTS principles will still apply. Another thing is that people severely underestimate the power of hardware prefetching (which is triggered by accessing successive cache lines). 6% cache efficiency with hardware prefetching still typically beats 100% cache efficiency jumping around random cache lines.
Data oriented design arranges your data in a way that all instance fields of an object oriented class type are packed next to each other, enabling vectorization a.k.a. SIMD processing (“add the value ‘1’ to 16 “daysOld” integers in one hardware instruction”, for instance).
While, if a whole game fits in cache, there won’t be many cache misses anymore, your data can only be processed in a scalar fashion, additionally producing overhead of address calculations per data instance.
And by the way: An operating system usually performs context switches about 1000 times per second. A context switch basically pauses the game and executes other processes. This could still lead to cache evictions; it’s not about whether or not you entire game fits in cache.
It is already a non-optimal value depending on your actual workload, never mind the CPU. At some point, they could make it configurable so we could test it.
It does seem like micro-micro optimisations at this point.
Biggest optimisation problem I have with DOTS is that I don’t feel confident enough to invest in it at the present state. I need it finished, documented, battle tested. All these are much more important to me. I am no longer young enough for leaps of faith.
I think the important thing here is the L1 cache, but it can also vary greatly with different CPUs, AMD Ryzen has 384KB, Intel Skylake 34KB, ARM 80 KB. They all have in common, all sizes are a multiple of 16.
So we will need GOPIMs with Game Orienteted Process in Memory features, e.g. SIMD and vector instructions.
What about DIrectX Graphical PIM features so that GPUs can offload some of their work to memory?
Amazed that this is not build or runtime dependent e.g. automatically configured for the hardware it’s running on?
I don’t know if it is a good idea to make a chunk exactly the same size as the L1 cache, so there may not be any space for other data that you might need? And if the chunks are connected, multiple chunks can be loaded into the cache.
AFAIK cache size has nothing to do with chunk size - maybe I’m missing something. I always thought that it was just an optimization of pre-allocation, anticipating dynamic entity instantiation, with a trade-off in regards to wasted RAM (that’s why they are that… “small”).
Data will be loaded into L1 either way. Similarly, data will be evicted from the cache hierarchy either way. Meaning: The hardware has full control over what is in cache and what isn’t and has no concept of chunks, even going as far as hardware prefetching loading in data outside of the bounds of a chunk.
Since DOTS is made for iterating over all of your data in one go and since chunks should really not be something the programmer even needs to know anything about, you’re not expected to have an entire chunk and only that chunk in cache at any time (which is also not guaranteed to be the case even if you tried).
So in my mind chunk size is a trade-off between the number of cache misses when iterating over your data and memory usage overhead, especially considering the existence of singletons. But then again: maybe I’m missing something.
I think it was already said by Unity several times that the chunk size was optimized for the cache. Chunks also have a header that is Exact 64Byte big to fit exactly in a cache line.
It’s about memory optimization. If the entire chunk fits into the cache, you can ideally iterate over all elements in the chunk without cache misses. As soon as the chunk size exceeds the cache size, only a part can be loaded into the cache and then it can lead to a cache miss before the iteration over the chunk is complete.
Chunk sizes has nothing to do with the cache size. It was just chosen as a middle ground of not too small and not too big size. The point of ECS it to only iterate over those components the system want to use and that’s usually only a few components in a chunk.
Thank you for reassuring me that I know what I’m talking about
… Because that just doesn’t make sense. As said previously, cache is completely under the control of the hardware. You cannot “load a chunk into L1” - data is loaded into cache line by line, 64 bytes (usually) at a time. And apart from that, since ECS is all about iteration, there should be no need for accessing a particular component twice meaning there is no use in having an entire chunk in cache. And even if there was, your data is never exactly the size of a single chunk. Either you waste 90% of your cache if your theory was right, or you need to remove a chunk from cache and load in another (again, if it worked that way, which it totally doesn’t. Not only when it comes to a single program, but also when context switches come into play).
What IS true, though: The smaller the chunks, the more cache misses you’ll have when iterating over a component array. So the only logical reason (I SEE) to have chunks in the first place is to avoid thousands of malloc calls per frame, since ECS generally moves data around in memory a lot.
A chunk don’t hold the data of one component type, but all components for the entites based on the archetype.
With each iteration you have all the components of an entity at your disposal, as long as you do not access external data in the loop that are not in the cache, you don’t have cache misses.
As i said, Unity’s ECS not work that way and data are only moved if there are structural changes.