as a simple example, say i would have a paralelfor job that adds velocities to positions
would it be advisable to feed the job a nativeArray of float3 position and nativeArray of float3 velocities or perhaps a nativeArray of structs with the following setup ?
For best practice, you should use dynamic buffers so you don’t have to sync with the main thread too much. You also keep your data as separated as possible, not like the struct you just posted. That is because some jobs might need positions only while other jobs might need both positions and velocities.
Then the best performance would be achieved from two separate NativeArrays of of float4 (instead of float3 for alignment/size reasons) by essentially doing:
pos[i] = pos[i] + vel[i];
While you are essentially using 33% more data with a float4 vs float3, the processing should be faster as you can operate on the to do any shuffling the data to be able to take advantage of the 128bit instructions.
thanks for the responses, i’m trying to understand how caching actually works
if it has to get position from the position array and then velocity from the velocity array,
won’t this cause cache misses as it has to switch arrays ?
that why i’m wondering if you put them together in an array, say interleaving the position and velocity data you’ll get better performance ?
@Robber33
Unfortunately I’m a bit unclear on this point myself. I believe the processor can detect that you are reading from two sets of contiguous memory and will prefetch from both arrays, but honestly I’m not really sure about that. There is also the AoSoA data layout which is kind of the best of both worlds with regard to data layout and cache locality.
In AoSoA, you would use the float4 not as a drop in replacement for a point/vector (x,y,z,w) but instead as a way of just representing a more general 4 floats. In memory an array of these objects would look like:
x x x x vx vx vx vx y y y y vy vy vy vy z z z z vz vz vz vz
More concretely, you get a structure like this:
struct PosAndVelocity4 {
float4 x;
float4 vx;
float4 y;
float4 vy;
float4 z;
float4 vz;
}
...
points[i].x = points[i].x + points[i].vx; // Adds 4 x's in one instruction
points[i].y = points[i].y + points[i].vy; // Adds 4 y's in one instruction
points[i].z = points[i].z + points[i].vz; // Adds 4 z's in one instruction
Working with memory in this format is a bit more complicated, and its not clear to me if it is a good idea to store the data in that format or just to have a preprocessing step that takes NativeArray<float4> positions and NativeArray<float4> velocity and combines them into NativeArray<PosAndVelocity4> when you need to do SIMD heavy processing on it.
That’s TCM (which is becoming much less common), not cache. Cache works in small segments of memory and there are many cache lines mapped to many different spots in memory at once. Look up 4/8-way set associative mapping to get a better feel for how this works.