IJobParallelForTransform and Burst?

Greetings,
I am attempting to set up a TransformJob, and am curious to see if/how it will support the Burst, however I have one issue currently, and it is this line:

transform.localRotation = math.mul(StartRotation,
math.mul(quaternion.Euler(0f, _eulers.y, 0f), quaternion.Euler(_eulers.x, 0f, _eulers.z)));

It hereby states as so in the Burst Inspector .VM IR Optimisation Diagnostics window:

When I observed the quaternion.cs-code at the GitHub-repositorie, I couldn’t find exactly where the s e l e c t is, except for in LookRotation and the safe option(s).

So, what am I doing wrong? Please help. ;w;

I’m not sure you’re doing anything wrong so much as there’s a problem with the quaternion.Euler( ) function you’re calling interacting with the LLVM compiler (through Burst)
.

See

1 Like

I have done a lot more testing now and it seems to come down to IJobParallelFor not supporting Burst? I even tried it empty, and it could not vectorize. I keep getting loop control flow is not understood by vectorizer even in an IJobParallelFor now? Error code is unknown:0:0 by the by.

Here is an example of a simple IJobParallelFor that is not understood by the vectorizer:

    [BurstCompile]
    public struct ForJob : IJobParallelFor
    {
        [ReadOnly]
        public NativeArray<float3> InPosition;

        [WriteOnly]
        public NativeArray<float3> OutPosition;

        [ReadOnly]
        public float DeltaTime;

        public void Execute(int index)
        {
            OutPosition[index] = OutPosition[index] + InPosition[index] * DeltaTime;
        }
    }

The Burst Inspector .VM IR Optimisation Diagnostics window responds henceforth with:
Remark: unknown:0:0: loop not vectorized: loop control flow not understood by vectorizer

What am I missing?

Ok, just tried the “successful vectorization” example from the SlideShare:
6215405--683312--ahahahahahaha.PNG
Code here:

    [BurstCompile]
    public struct VectorizeDemo : IJob
    {
        public NativeArray<int> Inputs;
        public NativeArray<int> Outputs;

        public void Execute()
        {
            for(int i = 0; i < Inputs.Length; ++i)
            {
                if(Inputs[i] >= 0)
                {
                    Outputs[i] = Inputs[i];
                }
                else
                {
                    Outputs[i] = 0;
                }
            }
        }
    }

And it still gets:
Remark: unknown:0:0: loop not vectorized: loop control flow not understood by vectorizer

What is going on?

For the example, uncheck Safety Checks in the Burst inspector.

For everything else, I have never been able to get the autovectorizer to kick in when using types defined by Mathematics.

Keep in mind Burst still makes a huge difference even for scalar code, so unless this is a real bottleneck, I wouldn’t fret too much about it.

1 Like

It was a bit confusing to see how the “Safety Checks”-option works because I couldn’t directly see the relation…

This is unfortunately the bottleneck of the project. It’s the system that assigns rotations for the animation system which will be used for a few thousand NPCs’ and the player’s joints. This is the only significant system in the game that will run every frame, however I need the frametime bandwidth to also allow for AI state machine updates (runs 4 times per second) and culling/other graphics jobs (like mesh creation).

I have looked into vertex animation for distant NPC but unfortunately it has limits as it is not as dynamic. It would have to be switched between, and making a system for this could potentially be a little out of scope for now, but if there is no other option I might look into it next.

Couple of things:

  • If you are using an already pre-vectorized type the compiler will generally not vectorize the loop. This is because the cost of undoing your vectorization (using a float3) and turning it into the ‘proper’ vectorized type (float4/float8) would outweigh any benefits.

  • You can see with the Burst inspector that the the body of your ForJob is using the vector unit (vmulps → vaddps).

  • For the VectorizeDemo job - @DreamingImLatios is correct that the safety checks are what is causing the vectorization to be disabled there. We’ve got some longer term plans to try and make LLVM understand this, but at present its a really thorny issue in the compiler to work around.

For 1. there is one additional workaround you could do - if you change when you schedule the ForJob, if you specify the arrayLength * 3, you can do the following:

[BurstCompile]
    public struct ForJob : IJobParallelFor
    {
        [ReadOnly]
        public NativeArray<float3> InPosition;

        [WriteOnly]
        public NativeArray<float3> OutPosition;

        [ReadOnly]
        public float DeltaTime;

        public void Execute(int index)
        {
            // when you schedule the job, remember to do arrayLength * 3!
            var actualIn = InPosition.Reinterpret<float>(UnsafeUtility.SizeOf<float3>());
            var actualOut = OutPosition.Reinterpret<float>(UnsafeUtility.SizeOf<float3>());

            actualOut[index] = actualOut[index] + actualIn[index] * DeltaTime;
        }
    }

Which turns the loop into:

.LBB0_11:
.Ltmp11:
       
        .cv_inline_site_id 1 within 0 inlined_at 1 0 0
        === MathTest.cs(506, 1)            actualOut[index] = actualOut[index] + actualIn[index] * DeltaTime;
        vmulps        ymm2, ymm1, ymmword ptr [rsi + 4*rax - 96]
        vmulps        ymm3, ymm1, ymmword ptr [rsi + 4*rax - 64]
        vmulps        ymm4, ymm1, ymmword ptr [rsi + 4*rax - 32]
        vmulps        ymm5, ymm1, ymmword ptr [rsi + 4*rax]
        vaddps        ymm2, ymm2, ymmword ptr [rdi + 4*rax - 96]
        vaddps        ymm3, ymm3, ymmword ptr [rdi + 4*rax - 64]
        vaddps        ymm4, ymm4, ymmword ptr [rdi + 4*rax - 32]
        vaddps        ymm5, ymm5, ymmword ptr [rdi + 4*rax]
.Ltmp12:
        vmovups        ymmword ptr [rdi + 4*rax - 96], ymm2
        vmovups        ymmword ptr [rdi + 4*rax - 64], ymm3
        vmovups        ymmword ptr [rdi + 4*rax - 32], ymm4
        vmovups        ymmword ptr [rdi + 4*rax], ymm5
        add        rax, 32
        cmp        rbx, rax
        jne        .LBB0_11

With safety checks off. It’s not the prettiest code but it should get you an additional 25% perf on SSE because you are using all 4 vector elements of each mul/add pair. It’s on my wish list that LLVM would improve its vectorization to be able to automatically do the above transformation for y’all, but at present this is the best we’ve got :slight_smile:

2 Likes

Thank you so much for this. :3c

The reason for the float3 is they are 3D-positions or Euler-rotations for objects. However, would it be more “efficient” to just use float4 and leave the .w at 0, using just .xyz?

So if you wanted to get really optimal by default then you’d want to split out the .x’s, .y’s and .z’s into separate Native Arrays of data - that’s the most optimal way to deal with data in a data oriented fashion.

That being said, for this example I think the float3’s are fine, and using my… hack? Above will get you a bit more performance out of the same code!

1 Like

Hmm, I have had a long think about data management for this now, and it seems that regardless of what I do there will unfortunately be a lot of floats floating around, but they could perhaps be grouped by something (but then we are back to why not just use float3-4s)? I am rather new to all of this and unfortunately kind of tied to Monobehaviours for legacy reasons.