Burst/Jobs best practices for data layout

as a simple example, say i would have a paralelfor job that adds velocities to positions

would it be advisable to feed the job a nativeArray of float3 position and nativeArray of float3 velocities or perhaps a nativeArray of structs with the following setup ?

struct {
float x;
float vx;
float y;
float vy;
float z;
float vz;
}

or is this something Burst compiler will optimise out for me ?

For best practice, you should use dynamic buffers so you don’t have to sync with the main thread too much. You also keep your data as separated as possible, not like the struct you just posted. That is because some jobs might need positions only while other jobs might need both positions and velocities.

The other thing to keep in mind is how this all maps to SIMD. If you are just component-wise doing this:

pos[i].x = pos[i].x + vel[i].x;
pos[i].y = pos[i].y + vel[i].y;
pos[i].z = pos[i].z + vel[i].z;

Then the best performance would be achieved from two separate NativeArrays of of float4 (instead of float3 for alignment/size reasons) by essentially doing:

pos[i] = pos[i] + vel[i];

While you are essentially using 33% more data with a float4 vs float3, the processing should be faster as you can operate on the to do any shuffling the data to be able to take advantage of the 128bit instructions.

1 Like

thanks for the responses, i’m trying to understand how caching actually works

if it has to get position from the position array and then velocity from the velocity array,
won’t this cause cache misses as it has to switch arrays ?

that why i’m wondering if you put them together in an array, say interleaving the position and velocity data you’ll get better performance ?

@Robber33
Unfortunately I’m a bit unclear on this point myself. I believe the processor can detect that you are reading from two sets of contiguous memory and will prefetch from both arrays, but honestly I’m not really sure about that. There is also the AoSoA data layout which is kind of the best of both worlds with regard to data layout and cache locality.

In AoSoA, you would use the float4 not as a drop in replacement for a point/vector (x,y,z,w) but instead as a way of just representing a more general 4 floats. In memory an array of these objects would look like:
x x x x vx vx vx vx y y y y vy vy vy vy z z z z vz vz vz vz

More concretely, you get a structure like this:

struct PosAndVelocity4 {
  float4 x;
  float4 vx;
  float4 y;
  float4 vy;
  float4 z;
  float4 vz;
}
...

points[i].x = points[i].x + points[i].vx;  // Adds 4 x's in one instruction
points[i].y = points[i].y + points[i].vy;  // Adds 4 y's in one instruction
points[i].z = points[i].z + points[i].vz;  // Adds 4 z's in one instruction

Working with memory in this format is a bit more complicated, and its not clear to me if it is a good idea to store the data in that format or just to have a preprocessing step that takes NativeArray<float4> positions and NativeArray<float4> velocity and combines them into NativeArray<PosAndVelocity4> when you need to do SIMD heavy processing on it.

I found the following presentation very insightful: https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf
In particular, the section starting on page 38

That’s TCM (which is becoming much less common), not cache. Cache works in small segments of memory and there are many cache lines mapped to many different spots in memory at once. Look up 4/8-way set associative mapping to get a better feel for how this works.

1 Like