[Bug] Burst performance issue with specific code

This is a very weird, very specific burst performance issue.

Was doing some benchmarks between a fixed point library and float and turned off a bunch of checks and the fixed point library suddenly performed much worse, even though it was executing less code.

I simplified this down to a basic performance test and it pretty much comes down to this.

This

                        var product = (long)this.Input[i] * this.Input[j];
                        var result = product >> 12;
                        this.Output[i] = (int)result;

executes 8x slower than

                        var product = (long)this.Input[i] * this.Input[j];
                        var result = product >> 12;

                        // 1 extra line of code which speeds up job 8x
                        result += (product & IntSignBit) >> (12 - 1);

                        this.Output[i] = (int)result;

Which makes no sense to me as the code is the same exact it has extra operations (add, sub, shift, and). I’d expect it to run slightly slower, not nearly an order of magnitude faster.

.I have repeated this dozens of times, tweaking stuff, run the test in different orders, upgraded burst (this is running on latest package now same result), forced recompiles and the result are the same.

Results:

Full Source:

The burst jobs

[BurstCompile]
private struct Test1Job : IJob
{
    public NativeArray<int> Input;
    public NativeArray<int> Output;

    public void Execute()
    {
        for (var i = 0; i < Count; i++)
        {
            for (var j = 0; j < Count; j++)
            {
                var product = (long)this.Input[i] * this.Input[j];
                var result = product >> 12;
                this.Output[i] = (int)result;
            }
        }
    }
}

[BurstCompile]
private struct Test2Job : IJob
{
    public NativeArray<int> Input;
    public NativeArray<int> Output;

    private const int IntSignBit = 1 << (12 - 1);

    public void Execute()
    {
        for (var i = 0; i < Count; i++)
        {
            for (var j = 0; j < Count; j++)
            {
                var product = (long)this.Input[i] * this.Input[j];
                var result = product >> 12;

                // 1 extra line of code which speeds up job 8x
                result += (product & IntSignBit) >> (12 - 1);

                this.Output[i] = (int)result;
            }
        }
    }
}

The tests

[Test]
[Performance]
public void Test1()
{
    NativeArray<int> input = default;
    NativeArray<int> output = default;

    input = new NativeArray<int>(Count, Allocator.TempJob);
    output = new NativeArray<int>(Count, Allocator.TempJob);

    for (var i = 0; i < Count; i++)
    {
        input[i] = Random.Range((int)MultiMin, (int)MultiMax);
    }

    Measure.Method(() =>
        {
            new Test1Job
                {
                    Input = input,
                    Output = output,
                }
                .Schedule().Complete();
        })
        .Run();

    input.Dispose();
    output.Dispose();
}

[Test]
[Performance]
public void Test2()
{
    NativeArray<int> input = default;
    NativeArray<int> output = default;

    input = new NativeArray<int>(Count, Allocator.TempJob);
    output = new NativeArray<int>(Count, Allocator.TempJob);

    for (var i = 0; i < Count; i++)
    {
        input[i] = Random.Range((int)MultiMin, (int)MultiMax);
    }

    Measure.Method(() =>
        {
            new Test2Job
                {
                    Input = input,
                    Output = output,
                }
                .Schedule().Complete();
        })
        .Run();

    input.Dispose();
    output.Dispose();
}

Please tell me I’ve done something stupid. I can’t see it though.

-edit-

safety system off
leak detection off
job debugger off
burst compilation on
synchronous compilation on

1 Like

Submitted a bug report: 1186429 with this test.

1 Like

I can reproduce - basically the second job is getting vectorized while the first isn’t - don’t have a handle on why yet though!

So our compiler backend LLVM decides that for the first job that ‘Vectorization is possible but not beneficial’ (which obviously isn’t the case!).

This is an annoying issue that we don’t have a good solution for yet - we want to allow our users to get guaranteed loop vectorization when they want it (even if it caused degraded performance in actuality!), but we don’t have a handle on exactly how to expose that yet.

2 Likes

Good quick response, thanks! I’m glad I’m not going insane. I’m pretty certain I checked around 15 times that I had scheduled Job1 in Test1, Job2 in Test2.

1 Like

So I started to look at this again while trying to work out why vectorization wasn’t possible, and I suddenly realised that the inner loop can basically be folded away to nothing if the compiler was smart enough, because each loop iteration is overwriting the same location in Output:

for (var i = 0; i < Count; i++)
{
    for (var j = 0; j < Count; j++)
    {
        var product = (long)this.Input[i] * this.Input[j];
        var result = product >> 12;

        // All loop iterations will replace the contents of this variable with a new value, so the loop is redundant!
        this.Output[i] = (int)result;
    }
}

Were you intending this? If I make Output sized to Count * Count. and store into this.Output[i * Count + j] the code is vectorized like I would expect.

3 Likes

So this was mostly just a simple repo and I just used a couple of loops so I could actually benchmark it. Sorry if my repo was a bit deceptive, but it was actually just an issue with a method and it popped up in all sorts of circumstances.

I stopped using this library but pulling it from repos I think it was from this method.

public static fix operator *(fix x, fix y)
        {
            var product = (long)x.rawValue * y.rawValue;

#if !FIXMATH_NO_OVERFLOW
            // The upper X bits should all be the same (the sign).
            var upper = (uint)(product >> MultiSign);
#endif

#if !FIXMATH_NO_OVERFLOW || !FIXMATH_NO_ROUNDING
            if (product < 0)
            {
#if !FIXMATH_NO_OVERFLOW
                if (~upper != 0)
                {
                    return Overflow;
                }
#endif

                // This adjustment is required in order to round -1/2 correctly
#if !FIXMATH_NO_ROUNDING
                product--;
#endif
            }
            else
            {
#if !FIXMATH_NO_OVERFLOW
                if (upper != 0)
                {
                    return Overflow;
                }
#endif
            }
#endif

#if FIXMATH_NO_ROUNDING
            return new fix((int)(product >> Shift));
#else
            var result = product >> Shift;
            result += (product & IntSignBit) >> (Shift - 1);

            return new fix((int)result);
#endif
        }

After stripping it all back and testing it independently I found it was just this code right at the bottom, which matches the work done in the above repo.

#if FIXMATH_NO_ROUNDING
            return new fix((int)(product >> Shift));
#else
            var result = product >> Shift;
            result += (product & IntSignBit) >> (Shift - 1);

            return new fix((int)result);
#endif

Basically my jobs always executed noticeably faster if FIXMATH_NO_ROUNDING was not defined even though it was more operations.

1 Like