Burst Loop Vectorization

I am trying to get started with writing code that gets autovectorized. I am really new to this low level Burst stuff but I am eager to learn.

Right now i can’t even get it to work for some very simple Test Loop:

[BurstCompile]
    public static void TestLoopVec([NoAlias] NativeArray<int>  dataA, [NoAlias] NativeArray<int> dataOut)
    {
        for (int i = 0; i < dataA.Length; i++)
        {
        
#if UNITY_BURST_EXPERIMENTAL_LOOP_INTRINSICS
            Unity.Burst.CompilerServices.Loop.ExpectVectorized();
#endif
            var a = dataA[i];
            var res = a + 1;
            dataOut[i] = res;
        }
    }

That loop is called from inside a Job Execute Function:

        var inputArray = new NativeArray<int>(4,Allocator.Temp);
        inputArray[0] = 0;
        inputArray[1] = 1;
        inputArray[2] = 2;
        inputArray[3] = 3;
      
        var outputArray = new NativeArray<int>(inputArray.Length,Allocator.Temp);
    
        TestLoopVec(inputArray, outputArray);

It’s probably something stupid im missing here but why wouldn’t this vectorize?

My actual usecase would be to evaluate a curve function for every element in a DynamicBuffer. This is happening 10k+ times per AI System Update so i thought it would make sense to atleast look into SIMD stuff.

Hey, I tried compiling your code with least possible modifications and on my end it looks like the loop is vectorized properly in most cases. I’m pretty bad at reading assembly though so don’t ask me exactly what’s happening. I just look for them pretty pink asm ops.

[BurstCompile]
public static class VectorizationTest
{
    [BurstCompile]
    public static void TestLoopVec([NoAlias] ref UnsafeList<int> dataA, [NoAlias] ref UnsafeList<int> dataOut)
    {
        for (int i = 0; i < dataA.Length; i++)
        {
            Unity.Burst.CompilerServices.Loop.ExpectVectorized();
            int a = dataA[i];
            int res = a + 1;
            dataOut[i] = res;
        }
    }
}

Compiles to this:
Assembly Screenshot

(Yes, you can compile a standalone function with Burst, although there’s some constraints, hence the UnsafeList instead of a NativeArray.) A complete job struct looks more like this:

[BurstCompile]
struct VectorizationTestJob : IJob
{
    public NativeArray<int> Data1;
    public NativeArray<int> Data2;

    void IJob.Execute()
        => TestLoopVec(Data1, Data2);

    static void TestLoopVec([NoAlias] NativeArray<int> dataA, [NoAlias] NativeArray<int> dataOut)
    {
        for (int i = 0; i < dataA.Length; i++)
        {
            Unity.Burst.CompilerServices.Loop.ExpectVectorized();
            int a = dataA[i];
            int res = a + 1;
            dataOut[i] = res;
        }
    }
}

Compiles like this:
Assembly Screenshot

Note that in both of these cases the loop doesn’t know what the size of the array is. In your original example the compiler can theoretically deduce the array to be of specific size. You’re creating a NativeArray struct with a specific (and very short) constant length, and your TestLoopVec function gets inlined and simplified based on that hint. I’m guessing that this is influencing the compilation result.

Back to your code, I put it back together like this:

[BurstCompile]
struct VectorizationTestOriginalJob : IJob
{
    public NativeArray<int> Data1;
    public NativeArray<int> Data2;

    void IJob.Execute()
    {
        var inputArray = new NativeArray<int>(4,Allocator.Temp);
        inputArray[0] = 0;
        inputArray[1] = 1;
        inputArray[2] = 2;
        inputArray[3] = 3;
        var outputArray = new NativeArray<int>(inputArray.Length,Allocator.Temp);
        TestLoopVec(inputArray, outputArray);
    }

    static void TestLoopVec([NoAlias] NativeArray<int> dataA, [NoAlias] NativeArray<int> dataOut)
    {
        for (int i = 0; i < dataA.Length; i++)
        {
            Unity.Burst.CompilerServices.Loop.ExpectVectorized();
            int a = dataA[i];
            int res = a + 1;
            dataOut[i] = res;
        }
    }
}

It compiles to this (skipping over some unimportant bits in the burst.initialize and burst.initialize.externals sections):

Assembly Screenshot

Technically there’s less vectorized instructions. However, I’m not getting any errors from Loop.ExpectVectorized(), and the diagnostics tab doesn’t warn me about Burst being unable to vectorize the loop. Let’s see what happens when we change the setup code to this:

void IJob.Execute()
{
    var inputArray = new NativeArray<int>(8192, Allocator.Temp);
    for (int i = 0; i < 8192; ++i)
        inputArray[i] = i;
    var outputArray = new NativeArray<int>(inputArray.Length,Allocator.Temp);
    TestLoopVec(inputArray, outputArray);
}

Assembly Screenshot

Not gonna lie, looks like latin to me. But it’s in pink and starts with a v so I’m happy. This looks much closer to the vectorized assembly above. Basically, the compiler apparently decided it’s not worth it to emit all this code for looping over an array that contains just 4 items. Perhaps someone more well-versed in pink vlatin can offer additional explanation.

3 Likes

Thanks for looking into it in so much detail!
I copied your Code with the bigger Array setup and this is how it looks for me:

Do you get a Compiler Error
“The loop is not vectorized where it was expected that it is vectorized.” too?

It might be relevant that i am doing this on a M1 Mac Processor. Maybe burst can’t handle it yet? I’m using Burst 1.5.0

Oh yeah, I forgot to mention, I’m on Burst 1.6.1, but that shouldn’t make a big difference. On my end I’m getting no vectorization errors for any of the examples.

This looks like x86 assembly. I’m guessing M1 doesn’t support all of the SIMD features when emulating x86, so that could be one reason why we’re seeing a difference. Have you tried compiling for the Neon architecture? (You should see a dropdown in the Burst Inspector window, but I don’t know much about Unity on M1, so I’m not sure that’s even possible)

1 Like

Ah sorry it’s my first time in the Burst Comiler Window :slight_smile: I checked out the NEON assembly and it indeed seems to be vectorized. And after reading up on Assembly lang for the last 2 hours I think it actually vectorizes the loop even in the assembly screen I posted last (seems to be SEE4). So the only question that remains is why the compiler throws the error in my case. I would really prefer to use “Unity.Burst.CompilerServices.Loop.ExpectVectorized();” instead of checking the assembly each time.

1 Like

Because whatever playform your currently targetting is isn’t vectorizing. If you can’t vectorizer for all platforms you’re going to have to wrap it conditionally for platforms you expect to vectorize (Neon). Alternatively in the burst settings disable platforms you don’t support.

In the Burst AOT Settings i am not able to choose NEON for 64bit Target CPU Architecture (SEE2,SEE4,AVX,AAVX2).
In the Build Settings I set it to Target Platform Standalone macOS and tried the diffrent Architecture settings (Intel 64bit + Apple Silicon, Apple Silicon, Intel 64bit). I always got the Compiler error. Are these settings even relevant for the JIT?
Maybe there are settings i don’t know about.

Is there a way to see which Assembly Code is actually running. When “AUTO” is selected in Burst Inspector shouldn’t it show me the current JIT Target Architecture?

I updated to Burst 1.6. In the changelog it said Burst has native support for M1 now. The game runs with a few more frames in Editor but still there is no setting in which i can get rid of the Compiler Error introduced by
“Unity.Burst.CompilerServices.Loop.ExpectVectorized();”

With Burst 1.6 when I try to inspect the Code for NEON (both THUMB2 and ARMv7A) only the Error:
“Loop is not vectorized where it was expected that it is vectorized” shows up.
Every other architecture seems to work. This was not the case with Burst 1.5.
The M1 should be ARMv8A which does not throw an error.

Maybe that compiler error shows up if for ANY platform the loop can’t be vectorized?

1 Like

Having the same confusion here:

        static unsafe void Calc1(
            [NoAlias] float *a,
            int count)
        {
            for (int i = 0; i < count; ++i)
            {
#if UNITY_BURST_EXPERIMENTAL_LOOP_INTRINSICS
                Unity.Burst.CompilerServices.Loop.ExpectVectorized();
#endif
                a[i] = a[i] + 1f;
            }
        }

Even simple tests like this one results in “BC1321: The loop is not vectorized where it was expected that it is vectorized.”, and I don’t understand why.
In Burst AOT settings I’m targetting Windows, from SSE2 to AVX2, and my current computer has an i9. I have Burst 1.8.2 and Unity 2021.3.f5.

1 Like

Try the same with an additional array and it will work. From my experience input/output is a requirement for proper SIMD code. Maybe it’s possible to do it by hand and use the intrinsics but I’m not really sure.

1 Like

Do you mean like this?

results _= a *+ 1f;*_

1 Like