I attended a session at Unite the other day (Intrinsics: low-level engine development with Burst) and one of the key points he made was how best to write SIMD code. Rather than just using a float3 in the same way as a Vector3 and expecting that to be magically faster we should unroll loops into small batches and write the SIMD code to execute the whole batch at once.
This makes sense in principle. However how should this actually be implemented within an IJobForEach in the ECS? In this context the code only has access to the data about a single entity so as far as I can tell there’s no way to go wide with SIMD across batches of entities. Is there a variant of IJobForEach which passes in 4 entities to every Execute call?
At the moment the only way I can see to implement this is to fall back to using an IJobChunk, but that’s not nearly as ergonomic as an IJobForEach!
I remember one of the Schedule() overrides taking a “batch size” param that defined how many items each parallell job took… has that been removed?
The batch size is there as far as I know, but that doesn’t give you access to multiple entities within a single Execute call. It just configures the internal batch size of the job scheduler.
First off, don’t optimize prematurely. IJobForEach jobs are probably not your bottleneck. Most likely your game will have an nlog n or n^2 algorithm where this kind of thing matters more and is also easier to float4 optimize.
Second, Burst can utilize the auto-vectorizer of LLVM, so check to make sure the code is not already being unrolled. Then write your code as if your were doing one of the four iterations in an unrolled loop (instead of float4s you operate on floats) if it is not unrolling and that might get the unroll to work.
It’s not really premature in this case. We’re building this for sale on the asset store and speed will be an important selling point - I can’t profile your game using my asset so I have to try and optimise the asset as much as possible ahead of time. I think around half of the work we’re doing can be expressed with SIMD, which is pretty significant!
I’ve already checked with the disassembly and the auto-vectoriser doesn’t manage to vectorise it. That’s pretty much what I expected - there’s quite a lot of work being done in a single iteration so I didn’t really expect it to manage to untangle all that and vectorise it.
All that aside, it just seems like a sensible idea for Unity to design their API in a way that doesn’t make SIMD near impossible to exploit!
The intrinsics are one part of the puzzle. When they ship you can use them on your own data structures. To get the full magic going we will need to store (some) ecs data in “Packets of 4/8” form. We have that on our roadmap.