Burst, SIMD and float3 / float4 - best practices

Hi, I am preparing a tutorial on JobSystem and Burst. I did some simple Burst performance measures using different approaches to float3/4 handling in terms of SIMD. I was always confused when writing algorithms and mixing these two types. I wasn’t sure whether there are some extra operations added due to swizzles, data loading etc.

Although, it is a simple synthetic test, maybe some of you will find it useful.

In the table are time measures in ms for 100k, 1M, 10M elements in data arrays and instructions count in the brackets

Instructions Count
Float3 (18)
0.28ms 2.19ms 21.8ms

Float4to3 (15)
0.28ms 2.62ms 26.1ms

Float4 (14)
0.28ms 2.64ms 26.1ms

Also this is interesting in the context of the GPU but I guess also valid for SIMD.

https://www.gamedev.net/forums/topic/336338-hlsl-sometimes-float4-sometimes-float3/?do=findComment&comment=3188515

The question is whether the Burst compiler can also combine different ops together and what are the best practices regarding this topic, in general. If someone from Unity ( @Joachim_Ante_1 ?) could comment on this, that would be much appreciated.

First, just for the reference, a scalar version. 14 assembly instructions

[ComputeJobOptimization]
struct Float1Job : IJob
{
    public int dataSize;
    [ReadOnly] public NativeArray<float> dataA;
    [ReadOnly] public NativeArray<float> dataB;
    [WriteOnly] public NativeArray<float> dataOut;

    public void Execute()
    {
        for (int i = 0; i < dataSize; i++)
        {
            float a = dataA[i];
            float b = dataB[i];
            float sum = a + b;
            float mul = a * b;
            float res = (sum - mul) / 10.0f;
            dataOut[i] = res;
        }
    }
}
mov     r8, qword ptr [rcx + 8]
mov     r9, qword ptr [rcx + 64]
movss   xmm1, dword ptr [r8 + rax]
movss   xmm2, dword ptr [r9 + rax]
movaps  xmm3, xmm2
addss   xmm3, xmm1
mulss   xmm2, xmm1
subss   xmm3, xmm2
mulss   xmm3, xmm0
mov     rdx, qword ptr [rcx + 120]
movss   dword ptr [rdx + rax], xmm3
inc     r10d
add     eax, 4
cmp     r10d, dword ptr [rcx]

Next, both data in calculations using float3s. We get 18 instructions, a few extra insertps and extractps instructions to prepare the data for SIMD:

[ComputeJobOptimization]
struct Float3Job : IJob
{
    public int dataSize;

    [ReadOnly] public NativeArray<float3> dataA;
    [ReadOnly] public NativeArray<float3> dataB;
    [WriteOnly] public NativeArray<float3> dataOut;

    public void Execute()
    {
        for (int i = 0; i < dataSize; i++)
        {
            float3 a = dataA[i];
            float3 b = dataB[i];
            float3 sum = a + b;
            float3 mul = a * b;
            float3 res = (sum - mul) / 10.0f;
            dataOut[i] = res;
        }
    }
}
cdqe
mov     r8, qword ptr [rcx + 8]
mov     r9, qword ptr [rcx + 64]
movsd   xmm1, qword ptr [r8 + rax]
insertps        xmm1, dword ptr [r8 + rax + 8], 32
movsd   xmm2, qword ptr [r9 + rax]
insertps        xmm2, dword ptr [r9 + rax + 8], 32
movaps  xmm3, xmm2
addps   xmm3, xmm1
mulps   xmm2, xmm1
subps   xmm3, xmm2
mulps   xmm3, xmm0
mov     rdx, qword ptr [rcx + 120]
movss   dword ptr [rdx + rax], xmm3
extractps       dword ptr [rdx + rax + 4], xmm3, 1
extractps       dword ptr [rdx + rax + 8], xmm3, 2
inc     r10d
add     eax, 12
cmp     r10d, dword ptr [rcx]

Next, I used float4 arrays but used swizzles to convert them to float3s for the computations. We get 15 instructions with only one extra blendps:

[ComputeJobOptimization]
struct Float4to3Job : IJob
{
    public int dataSize;

    [ReadOnly] public NativeArray<float4> dataA;
    [ReadOnly] public NativeArray<float4> dataB;
    [WriteOnly] public NativeArray<float4> dataOut;

    public void Execute()
    {
        for (int i = 0; i < dataSize; i++)
        {
            float3 a = dataA[i].xyz;
            float3 b = dataB[i].xyz;
            float3 sum = a + b;
            float3 mul = a * b;
            float3 res = (sum - mul) / 10.0f;
            dataOut[i] = new float4(res, 0);
        }
    }
}
cdqe
mov     r8, qword ptr [rcx + 8]
mov     r9, qword ptr [rcx + 64]
movups  xmm2, xmmword ptr [r8 + rax]
movups  xmm3, xmmword ptr [r9 + rax]
movaps  xmm4, xmm3
addps   xmm4, xmm2
mulps   xmm2, xmm3
subps   xmm4, xmm2
mulps   xmm4, xmm0
blendps xmm4, xmm1, 8
mov     rdx, qword ptr [rcx + 120]
movups  xmmword ptr [rdx + rax], xmm4
inc     r10d
add     eax, 16
cmp     r10d, dword ptr [rcx]

Finally, both the data and computations using float4s. We get 14 instructions, just as a reference scalar code.

[ComputeJobOptimization]
struct Float4Job : IJob
{
    public int dataSize;

    [ReadOnly] public NativeArray<float4> dataA;
    [ReadOnly] public NativeArray<float4> dataB;
    [WriteOnly] public NativeArray<float4> dataOut;

    public void Execute()
    {
        for (int i = 0; i < dataSize; i++)
        {
            float4 a = dataA[i];
            float4 b = dataB[i];
            float4 sum = a + b;
            float4 mul = a * b;
            float4 res = (sum - mul)/ 10.0f;
            dataOut[i] = res;
        }
    }
}
cdqe
mov     r8, qword ptr [rcx + 8]
mov     r9, qword ptr [rcx + 64]
movups  xmm1, xmmword ptr [r8 + rax]
movups  xmm2, xmmword ptr [r9 + rax]
movaps  xmm3, xmm2
addps   xmm3, xmm1
mulps   xmm1, xmm2
subps   xmm3, xmm1
mulps   xmm3, xmm0
mov     rdx, qword ptr [rcx + 120]
movups  xmmword ptr [rdx + rax], xmm3
inc     r10d
add     eax, 16
cmp     r10d, dword ptr [rcx]
4 Likes

So I have made an additional test and it seems that Burst wont combine float3 and float operations into single SIMD instruction. In other words, it will load float3 as float4 to do the SIMD addps and then it will do the scalar addss on the float.

In this case, I think that the conclusion is: if we are not memory bound, it is the fastest to store the data as float4s. We get all the basic arithmetic operations (addps, mulps etc.) with no insertps and extractps overhead and using float3 ops such as math.cross via float4.xyz swizzle costs only a single extra blend ops.

Also the Burst compiler won’t unroll the loops ( Loop Vectorization (unroll) ) nor combine statements like below into SIMD. Long story short, you need to explicitly ensure that the vectorization is being actually used. Or am I missing something?

for (int i = 0; i < dataSize; i+=4)
{
    dataOut[i + 0] = dataA[i + 0] + dataB[i + 0];
    dataOut[i + 1] = dataA[i + 1] + dataB[i + 1];
    dataOut[i + 2] = dataA[i + 2] + dataB[i + 2];
    dataOut[i + 3] = dataA[i + 3] + dataB[i + 3];
}
[ComputeJobOptimization]
struct FloatAndFloat3Job : IJob
{
    public int dataSize;

    [ReadOnly] public NativeArray<float> dataS;
    [ReadOnly] public NativeArray<float3> dataA;
    [ReadOnly] public NativeArray<float3> dataB;
    [WriteOnly] public NativeArray<float3> dataOut;

    public void Execute()
    {
        for (int i = 0; i < dataSize; i++)
        {
            float3 a = dataA[i].xyz;
            float3 b = dataB[i].xyz;
            float s = dataS[i];
            float3 sum = a + b;
            float sumS = s + s;
            float3 mul = a * b;
            float3 res = (sum - mul) ;
            dataOut[i] = res * sumS;
        }
    }
}
mov     r9, qword ptr [rcx + 8]
mov     r10, qword ptr [rcx + 64]
cdqe
movsd   xmm0, qword ptr [r10 + rax]
insertps        xmm0, dword ptr [r10 + rax + 8], 32
mov     r10, qword ptr [rcx + 120]
movsd   xmm1, qword ptr [r10 + rax]
insertps        xmm1, dword ptr [r10 + rax + 8], 32
movsxd  rdx, edx
movss   xmm2, dword ptr [r9 + rdx]
movaps  xmm3, xmm1
addps   xmm3, xmm0
addss   xmm2, xmm2
mulps   xmm1, xmm0
subps   xmm3, xmm1
shufps  xmm2, xmm2, 192
mulps   xmm2, xmm3
mov     r9, qword ptr [rcx + 176]
movss   dword ptr [r9 + rax], xmm2
extractps       dword ptr [r9 + rax + 4], xmm2, 1
extractps       dword ptr [r9 + rax + 8], xmm2, 2
inc     r8d
add     edx, 4
add     eax, 12
cmp     r8d, dword ptr [rcx]
jl      .LBB0_2

Due to the above, simplyconverting the data from float1 array to float4 for SIMD requirers more than a double of instructions (10 vs 23)! Is there any better way of doing this? Like some casts of variables or maybe even whole arrays when setting up the jobs?

for (int i = 0; i < dataSize; i ++)
{
   //data arrays are float4
    float4 a =   dataA[i];
    float4 b =  dataB[i];

    float4 sum = a + b;

    dataOut[i] = sum;
}

10 instructions

mov     r8, qword ptr [rcx + 8]
mov     r9, qword ptr [rcx + 64]
movups  xmm0, xmmword ptr [r8 + rax]
movups  xmm1, xmmword ptr [r9 + rax]
addps   xmm1, xmm0
mov     rdx, qword ptr [rcx + 120]
movups  xmmword ptr [rdx + rax], xmm1
inc     r10d
add     eax, 16
cmp     r10d, dword ptr [rcx]
for (int i = 0; i < dataSize; i+=4)
{
   //data arrays are float1
    float4 a = new float4(dataA[i + 0], dataA[i + 1], dataA[i + 2], dataA[i + 3]);
    float4 b = new float4(dataB[i + 0], dataB[i + 1], dataB[i + 2], dataB[i + 3]);

    float4 sum = a + b;

    dataOut[i + 0] = sum.x;
    dataOut[i + 1] = sum.y;
    dataOut[i + 2] = sum.z;
    dataOut[i + 3] = sum.w;
}

23 instructions

lea     eax, [rdx - 12]
movsxd  r9, eax
lea     eax, [rdx - 8]
movsxd  r10, eax
lea     eax, [rdx - 4]
movsxd  r11, eax
movsxd  rdx, edx
mov     rax, qword ptr [rcx + 8]
mov     rsi, qword ptr [rcx + 64]
movups  xmm0, xmmword ptr [rax + r9]
movups  xmm1, xmmword ptr [rsi + r9]
addps   xmm1, xmm0
mov     rax, qword ptr [rcx + 120]
movss   dword ptr [rax + r9], xmm1
mov     rax, qword ptr [rcx + 120]
extractps       dword ptr [rax + r10], xmm1, 1
mov     rax, qword ptr [rcx + 120]
extractps       dword ptr [rax + r11], xmm1, 2
mov     rax, qword ptr [rcx + 120]
extractps       dword ptr [rax + rdx], xmm1, 3
add     r8d, 4
add     edx, 16
cmp     r8d, dword ptr [rcx]

if you have an array of float1 you need to convert them somewhere.
or you can do the equivalent of C++ “reinterpret_cast<float4*>(float1Array)” using unsafe code

If you want to ensure that your code is 100% vectorised. Keep everything as float4 / int4 and do manual SOA form.

Over time we will keep improving burst and particularly the auto-vectorisation to make sure we produce the best possible code when it is written as scalar code.

6 Likes

Thanks, I have shared the above code snippets here GitHub - korzen/Unity3D-JobsSystemAndBurstSamples: Examples of using the Job System in Unity3D

1 Like

Sorry for the late feedback.

Currently, there is an issue with the released version of burst 0.2.3 that has a regression with noalias. As you have experienced, all your scalar loops are not auto-vectorized. It has been fixed on our side, we will publish hopefully a new version in the coming days.

Note that there will be still cases where the auto-vectorizer won’t be able to do its job. For the 2018.2 beta, we will give more details about these cases (some of them are in this thread) and we will work to improve auto-vectorization in subsequent releases

1 Like

FYI, we just released a new version of burst that is fixing the regression when auto-vectorizing scalar loops. You can update to latest burst package 0.2.4-preview.4 (you can follow this post on how to update your manifest.json )

This should vectorize correctly the scalar loop in your example above and make it equivalent to a manual float4 loop.

4 Likes