(source) Based on Intel Kaby Lake i7–7660U CPU running at 2.5GHz.
So the largest performance boost from DOTS/ECS is greatly reduced the need to spend 100ns to get information from main memory should be almost zero.
So in theory the potential gains are maintaining L1-L2 cache flow and keeping access times down to the 1-10 ns ranges. L3 cache access would double that to 20 ns so that would limit DOTS to being 50% faster for the times that the CPUs cache prediction algorithm fails to have the data ready on time.
TLDR; DOTS/ECS 10x* performance boost could be reduced to 2x**.
10x based on L1-2 access times of 10ns and RAM access times of about 100ns.
** 2x based on L1-2 access times of 10ns and L3 access times of 20ns.
It’s not really feasible to control cache use, instead it is better to format your access patterns to give the CPU design an opportunity to do it’s thing, should it favour doing that thing. These are complex beasts these days and are seldom just running our own programs.
Remember, DOTS has to work across many different CPUs, it won’t be a good idea to get completely specific.
Good point yet ECS for Unity was first demoed around 2017 at that time the AMD CPUs had 16 Mb of L3 cache in 2022 a similar CPU has 32 Mb of L3 cache.
Each boost in cache size reduces the performance benefits of DOTS/ECS vs none DOTS code on larger cache CPUs.
It would be interesting to see DOTS vs Mono on a range of CPUs to see the impact of cache size on the performance delta.
I have already looked into the workload vs performance aspect of DOTS a little bit and at lower workloads that already fit within the CPUs caches there is negligible performance advantages and sometimes the overheads of using DOTS outweigh the performance boost.
Green Bars DOTS (failed to render graphics so looks faster than it is).
This was using my rather old and lower cache size CPU so more modern CPUs should find that only when they have many thousands of data items (> 1k) to process do they start to benefit from DOTS.*
So a 96 Mb L3 cache CPU would probably need to be throwing around workloads of 10k+ data items to even gain a large enough benefit from DOTS to be worth it.*
*Rough workload sizes based on very simple demo that moves things around e.g. position|float data.
Yes and No. It’s the size of the memory and the speed that speed up workloads that fit within that memory.
DOTS also goes out of it’s way to ensure that it’s workloads are packed together so they fit well into memory whereas normal OO data can be anywhere in memory and take more work to bring together for processing.
Having larger L3 caches mean that is more likely that all of the data is readily available already and faster to process regardless of using DOTS or not.
DOTS alleviates the chances of a cache miss (by packing the data together), this is where the program needs data that is not in a cache and has to wait for about 100 nanoseconds or more to go and get the data from main memory.
This is also the area where DOTS provides the largest boost in performance as L1-L2 caches are about 10x faster than main memory. L3 cache is around 5x faster, hence the reason that a larger L3 cache will boost performance on larger workloads negating the performance advantage of DOTS to some degree.
Isn’t that mean DOTS will have more data to work with? Larger cache == larger chunks == more SIMD calls before cache miss
I think you are missing the point. Yes, you can have fewer cache misses in OOP with larger caches, but you will have fewer of those practically guaranteed because of DOTS.
EDIT.
Did you test more fragmented memory in OOP test? How can you tell the difference when all data you need is in the same cache? DOTS has an advantage when it isn’t and OOP can have cases when your data isn’t in the same cache line even if it’s smaller.
Both statements cannot be true at the same time. I for one go with AMD claim of 15% performance improvement, and not your speculated 500% performance improvement.
Both statements can be true if they are about different things.
The 15% is based on AMDs benchmarking of existing games @ (Watch Dogs Legion +36%, Far Cry 6 +24%, Gears 5 +21%, Final Fantasy XIV +16%…) vs 5900X.
The 200% (2x) theoretical performance boost of DOTS on CPUs where the workload fits within the L3 cache. This will vary by CPU hardware and is dependent on the size and speed of it’s caches as well as it’s SIMD instructions.
ECS is currently 10x faster (=1000%) for workloads designed for it. You are speculate this advantage in the same Workflows goes down to 2x =200%. Which makes the new X3D CPU 5x faster (500%) than the current CPU. I cannot combine this with the 15% increase AMD claims in average.
Again, futile conversation as everybody knows by now you do not like ECS…do you think your mission to convince all those clever people at Unity, Mike Acton, Joachim (CTO!) they are all wrong will succeed? You made your point plenty of times already, this is just spamming the forum.
Honest question: can we realistically expect “relevant”/“big” games to have no more data than 96MB (also, what is “data” here ?)
What portion of commercial games fit that requiement ?
I understand we can sort of ignore textures and meshes (which usually take up most of the memory) given the fact that they usually are not interacted with a lot on the CPU. But we still have big chunks of data to fit inside the memory. Namely localization, navigation data, assets databases, etc. This quickly adds up and it lefts not much room for actual game simulation data.
If we assume we have 60MB left (which might be too generous already) for game data, and we have entities weighting 256 bytes on average, that’s about roughly 250K entities.
I know that it is a very rough and impecfect calculation but it is already a pretty small entity count. ECS is made for handling millions of entities and (our computers often have 8-32GB of rams, thus giving the dream of having huge simulation).
In conclusion, I doubt we could realistically expect games to fit that requiement, as a design pattern or rule of thumb.
Then if so much game data exists in game systems then surely it will negatively impact DOTS performance even in Unity.
Or if the same amount of game data is used in an OOP game and a DOTS game then the only performance difference would be the cache size dependent performance of DOTS.
Personally I don’t know how much the tested games and their engines use SOA data, SIMD instuctions and Multi-threading for performance so we don’t know if they use DOTS/Burst/Job like systems within their code base.
We do know that a 15% improvement in performance is a good boost and very near the +20% we have been seeing between CPU generations recently.
What we need is a Unity game preferably functionally and quality equivalent to the benchmarked games that can have DOTS / OOP modes toggled for benchmarking.
The main point of DOTS is to force devs to think in a DOD way and actually gather related pieces of data together so that it can be pulled by the CPU into cache (what ever the cache size is) and operated on quickly because it then favors cache hits and limit cache miss.
On the other hand, OOP favors what could be called a “human readable design” (this is arguable for many devs, though) and forgets about data layout (which is then the responsability of the devs, thus leading to disasters).
Anyway, whatever performance improvements a new CPU can bring will automatically benefit OOP as well as DOTS, as well as any other programming paradigm for that mater.
Well it’s debatable ALL of DOTS is designed to force devs to think in a DOD way as that’d be mostly ECS because you can use Burst and Jobs outside of it. But I get the meaning in your context.
Another thing is we’re not required to move from monobehaviour to take advantage of any further architecture or work Unity does. Some people might favour OOP a lot, and frankly some code is plain much better in OOP land. Being able to do all sorts of code with DOTS generally being the platform, is a great thing.
I don’t think a lot changed: cache size tripled, RAM quadrupled, cpu core count keeps increasing. Yes, very small (non-unity) games might benefit a little bit from this when not optimizing memory access patterns. When actually using the memory the cache size increase only keeps status quo.
The rendering pipeline alone might access more than 96MB per frame when syncing data, filling command buffers executing culling, streaming textures thus removing all your game state data from cache which is then refetched slowly with one cache miss per object.
DOTS has cache advantages by having a linear access pattern meaning it will not have a lot of cache misses even when the memory is not yet in cache as it is prefetched automatically by the cpu.