Removing branching with math.select and SIMD

Hi,

I have recently come across math.select and am now trying to reduce branching in some of my bursted jobs to improve performance. But I’m not sure if what I’m doing is actually improving performance or not, so I’m hoping someone can tell me please :slight_smile:

Let’s say I have the following code:

if (something)
{
    myQuaternion = quaternion.LookRotation(myForward, math.up());
}
else
{
    myQuaternion = quaternion.identity;
}

In theory, I could change this to use math.select like the following:

myQuaternion = math.@select(quaternion.identity.value, quaternion.LookRotation(myForward, math.up()).value, something);
  1. Firstly, I’m assuming here that quaternion.LookRotation() will only be called if ‘something’ is true due to the inline bursted nature of the function turning it into a ternary operation? I’m sure there’s a way to easily check the compiled code but I’m not familiar with the process.

  2. Secondly, is there even any SIMD benefit with a math.@select(float4, float4, bool) as above, or does that only really come from the conditional versions like math.@select(float4, float4, bool4) that use a bool4? I’ve tried to find this information online but I can only find this thread which talks about the bool4 version:

If indeed the only SIMD-capable overloads are ones with multiple bools, then I feel like they can only be used in very specific situations, and therefore removing some branching in performance-critical code using this function is more difficult than it first seemed.

Many thanks

1 Like

I think this thread also covers this. Is there reason to use math.select?

1 Like

This is true, thanks :slight_smile: I had actually seen this also and it does give a good overall description of what’s going on but not with the specifics I’m looking for unfortunately.

The C# source code of math.select shows what the managed implementation looks like but not with any details of SIMD optimisation. Specifically, I’m after the point at which we start seeing benefits of vectorisation using math.select.

The thread listed in my original post illustrates that the single-bool overload is equal to the ternary operator but the bool4 overload takes full advantage of SIMD. So we have both ends of the spectrum but what about the functions in the middle? For me, the most common ones I’m using are math.select(float3, float3, bool) and math.select(float4, float4, bool) so it’d be good to know if these have some SIMD speedup.

The key is that math.select is wired into the compiler, giving it more information to work with. Whether or not that information is helpful may not always be clear. I have seen Burst generate all sorts of code from math.select.

One of the more unique things I’ve seen is if I were to compare two float4s with each other (greater_than), extract a bool4, and then use that is multiple math.select operations. If one of the math.select operations is the two float4s used in the original comparison, Burst wil replace that instruction with what is effectively a math.max.

In this case, Burst has assessed that the latency of math.max is equivalent to the SIMD blend, but no longer dependent on the result of the comparison operation. So by doing it this way, the CPU could potentially run both the comparison and the max operation at the same time using different ALU ports.

1 Like

The best way to check for performance improvements is to measure. Setup simple tests running over say 1 million values (run it multiple times before timing to do warmups) and time it using a simple Stopwatch in C# or something.

Regarding checking if something is vectorized or not, the best way to do it is to look at the assembly with the burst inspector. Here’s a good video on SIMD in burst in general, which also gives a good “smell test” that allows you to gain some understanding of your SIMD assembly code, even if you don’t understand all of it.

Regarding whether or not your example is vectorized I created a small test job:

        [BurstCompile]
        public struct SelectTest
            : IJobParallelFor
        {

            [ReadOnly]
            public float3 Forward;

            [ReadOnly]
            public NativeArray<bool> Check;

            [WriteOnly]
            public NativeArray<quaternion> Result;

            public void Execute(int index)
            {
                Result[index] = math.select(quaternion.identity.value, quaternion.LookRotation(Forward, math.up()).value, Check[index]);
            }
        }

This generates the following assembly for the select:

.Ltmp3:
        test        byte ptr [rdx], 1
        vmovaps        xmm1, xmm7
        je        .LBB4_7
        vmovaps        xmm1, xmm0
        jmp        .LBB4_7

So, in this case math.select is not removing the branch.
If you changed Check[index] to (bool4)Check[index] you would see burst use some blend operations in the assembly, rather than jumps. I assume it doesn’t do it here because converting the single bool to a bool4 is estimated to cost more than a relatively short jump.

A good place to start if you want to get more familiar with SIMD stuff is to open the Burst Inspector and try to figure out what’s going on in simple jobs, googling instructions you don’t understand etc. Just get into the habit of looking at the generated assembly in the burst inspector. Just by opening it up, looking at it, and thinking a bit about what’s going on there you’re making it less scary and each time you understand a bit more :slight_smile:

5 Likes

Thanks very much for the replies! This has cleared up a lot for me and I appreciate you taking the time :smile:

I definitely need to become more familiar with assembly and will follow your advice on checking the burst inspector more often. You’re right, it is a bit scary to begin with :stuck_out_tongue: I’ll also check out the video.

Cheers

Honestly, this is more difficult to answer than a beginner or even someone familiar with everything related might initially think.
I’d say “always use select and let the compiler decide the rest” for a start, unless you want to get into all the details.

In this case, though, a branch is much better.
Either code path might be a one liner in C#, but the first one is quite large in relation to the second in machine code. It isn’t worth calculating both results and merging them into one in the end in a branch free manner. And this is why select powerful: Imagine a float4 vector ‘x’ where you need either log(x) or pow(x, math.E) for each element, based on some vector of corresponding booleans. It would be terrible to dynamically call log n times and pow m times in a scalar fashion, based on 4 branches. You rather calculate both results and merge the desired values from two vectors into a single vector based on a boolean vector. And that is a single hardware instruction select wraps.
But, as stated, there can be more to this. Imagine codepath1 being much larger in size than codepath 2 and codepath1 has only a 20% chance of containing a desired value. Do you still use select or insert some a branch with something like if (math.all(x >= 0f))?

It was hopfully made clear that that is only really relevant for vector types and not complex structs like a quaternion or even scalar ints.

Have you tried eliminating the branching by doing both operations and multiplying by the condition as int ?
Sometimes I use this trick to eliminate branching when both paths are very light to process.

Eg:
int resultCondition = boolean Condition as int. ( 0 or 1)
int resultCondition2 = ! boolean Condition as int. ( 0 or 1)
var result = (path1 * resultCondition ) + (path2 * resultCondition2)

,

But multiplication is relatively slow :wink: Use…

unsafe static byte AsInt(bool b)
{
    return *(byte*)&b
}

static int AsMask(bool b)
{
    return -AsInt(b)
}

And then (path1 & AsMask(condition1)) + (path2 & AsMask(condition2)). Negating and AND-ing (negating 1 is -1 which is all 1s in 2s complement; -0 = 0 which is all 0s in 2s complement) only ever takes 2 cycles; multiplication 4-5. And you can only issue one multiplication per clock cycle in X86. But make sure not to write AsMask(!condition), instead write ~AsMask(condition). Some X86 CPUs support ANDN := a & ~b as a 1 clock cycle instruction (instead of an additional mybool ^ 1)
Bithacker off.

Nvm edit: As said, in this specific example a branch is much cheaper. A high performance integer % operator inserts a branch, checking whether the divisor is greater than the dividend in some implementations. It is seriously faster than a single hardware instruction, depending on your data of course.
Additionally, sometimes doing the bool as int trick is not worth it or rather 1-2 clock cycles slower than a conditional move, which is what simple selection branches are usually compiled down to. They read from the flags register and replace a value or not based on that flag in the same clock cycle; the fast bool as int trick requires a SETxx instruction (move flag to byte register), negate, AND, +/OR to select between two values, etc.
You’ll always have to look at the assembly if you want to micro optimize this close to the hardware, there is no way around it. Otherwise you’ll have to trust the compiler; they’re usually absolute beasts. And if you do, you can write simple code all the way through and optimize the stuff that really gets in your way by analyzing the assembly and not just “removing branches”. Or you could write some kind of bithacking/math library which optimizes every single little thing, because you don’t know if that code is called within a triply-nested for loop, but I don’t know anyone crazy enough to do such a thing :wink:

1 Like

While ago I wrote this basics test for branching. (see 2 posts down, where I have added select)
However, someome could expand it, by run latest bursted test again and using float4 for example to support SIMD.
Git source code is available there.