Benchmarking Burst/IL2CPP against GCC/Clang machine code (Fibonacci, Mandelbrot, NBody and others)

I was curious how well Burst/IL2CPP optimizes C# code against GCC/Clang with C, so I’ve ported five famous benchmarks, plus a raytracer, a minified flocking simulation, particle kinematics, a stream cipher, and radix sort, with different workloads and made them identical between the two languages. C code compiled with all possible optimizations using -DNDEBUG -Ofast -march=native -flto compiler options. Benchmarks were done on Windows 10 w/ Ryzen 5 1400 using standalone build. Mono JIT and RyuJIT are included for fun.

Source code and benchmark results are available on GitHub.

These wonderful people make open-source better:

4776608--522383--x.png

This project is sponsored by JetBrains.

4776608--578911--jetbrainslogo.png

20 Likes

Just easier to see the numbers with thousand commas.

1 Like

Ah, thanks. I was a bit lazy to format numbers manually.

private struct MandelbrotBurst : IJob {
        public uint width;
        public uint height;
        public uint iterations;
        public float result;

        public void Execute() {
            result = Mandelbrot(width, height, iterations);
        }

        private float Mandelbrot(uint width, uint height, uint iterations) {
            float data = 0.0f;

            for (int i = 0; i < iterations; i++) { // ideally you should have for loops broken up into jobs
                float  // should declare variables outside of loops (*1)
                    left = -2.1f,
                    right = 1.0f,
                    top = -1.3f,
                    bottom = 1.3f,
                    deltaX = (right - left) / width,  //invert this e.g.  invWidth = 1f / width; outside of the loop and multiply.
                    deltaY = (bottom - top) / height, //ditto for height.
                    coordinateX = left;

                for (int x = 0; x < width; x++) { // ideally you should have for loops broken up into jobs 
                    float coordinateY = top; // (*1)

                    for (int y = 0; y < height; y++) { // ideally you should have for loops broken up into jobs
                        float workX = 0;  // (*1)
                        float workY = 0; 
                        int counter = 0; 

// should use Burst Mathermatics.Sqrt not Math.Sqrt
                        while (counter < 255 && Math.Sqrt((workX * workX) + (workY * workY)) < 2.0f) { 
                            counter++;

// recalculating workx * workx and worky multiple times in the loop and conditional test.
                            float newX = (workX * workX) - (workY * workY) + coordinateX; //(*1)

                            workY = 2 * workX * workY + coordinateY;
                            workX = newX;
                        }

                        data = workX + workY;
                        coordinateY += deltaY;
                    }

                    coordinateX += deltaX;
                }
            }

            return data;
        }
    }

Just some thoughts on how you can optimise this code.

Note you really should break down the for loops into jobs for maximum throughput.

1 Like

Yes, such changes will have an impact. My main goal was to keep the code the same with C using only plain methods and loops with standard math, to see how compiler itself can optimize it for me without spending additional energy. Burst is optimized for tight loops in general, @xoofx already tried Mandelbrot test as far as I know, would like to hear his thoughts regarding this by the way.

Just checked the source code of Unity.Mathematics and math.sqrt() is the same standard System.Math.Sqrt().

Just add more information:

https://docs.unity3d.com/Packages/com.unity.burst@1.1/manual/index.html

TBH I think event if wasn’t implemented as intrinsics would be great to show how burst deal with already existing code/libs.

Thanks for benchmarking it.

[ ]'s

1 Like

Thanks, good to know this.

After inspecting an assembly code produced by GCC, found that to make it faster than Burst in the Fibonacci test, we can use -fno-asynchronous-unwind-tables to suppress the generation of static unwind tables for exception handling. 103,578,985 ticks down to 84,983,484, but other tests remain unaffected. Will add this as a note.

Can Burst optimise Sqrt?

A → (Burst) Mandelbrot: 33,005,292 ticks
B → (Burst) Mandelbrot: 6,495,352 ticks

Version B

private float Mandelbrot(uint width, uint height, uint iterations)
        {
            float data = 0.0f;

            float
                    left = -2.1f,
                    right = 1.0f,
                    top = -1.3f,
                    bottom = 1.3f,
                    deltaX = (right - left) / width,
                    deltaY = (bottom - top) / height,
                    coordinateX = left;

            float coordinateY;  // moved variables outside of loops
            float workX = 0;
            float workY = 0;
            int counter = 0;

            int x;
            int y;

            float newX;

            float workX2 = 0; // added variable for square values that are reused.
            float workY2 = 0;

            for (int i = 0; i < iterations; i++)
            {              

                for (x = 0; x < width; x++)
                {
                    coordinateY = top;

                    for (y = 0; y < height; y++)
                    {
                        workX = 0;
                        workY = 0;
                        counter = 0;

                        workX2 = 0;
                        workY2 = 0;

                        while (counter < 255 && Math.Sqrt((workX2 + workY2)) < 2.0f)
                        {
                            counter++;

                            newX = (workX2) - (workY2) + coordinateX;

                            workY = 2 * workX * workY + coordinateY;
                            workX = newX;

                            workX2 = workX * workX;
                            workY2 = workY * workY;
                        }

                        data = workX + workY;
                        coordinateY += deltaY;
                    }

                    coordinateX += deltaX;
                }
            }

            return data;
        }
    }
2 Likes

Actually I wonder if the Mandelbrot and NBody could gain from working with float2/double3’s as it might give the Burst compiler more SIMD options than just using basic variables.

Since it’s implemented as the intrinsic, Burst utilizes built into compiler function for square root.

Interesting, so nested stack allocations makes it significantly slower, gonna check how it works on my machine and compare it to GCC.

Absolutely, this will have an impact as well.

Apparently recursive fibonacci is the slowest way to do it, you might want to try the fastest fibonacci algorithm Fast Fibonacci algorithms

Also there is a SIMD optimised version of the mandelbrot set algorithm here → Performance in Mandelbrot Set Computation | Martin Ueding

The developer reports a doubling of performance between the C++ implementation and the SIMD one, in theory Burst should match or beat the SIMD version.

Doing fair benchmarks is hard. The first thing to be clear about is what does the benchmark try to benchmark?

  1. Taking completely unoptimised scalar code and see how well the auto-vectoriser and code optimisations work out.
  2. You want to write general purpose cross platform performant code. Are willing to spend the time to force the compiler to generate good code. As much as possible you are using explicit float4 SOA vectorised code. You want control.
  3. You are a SIMD performance expert, you understand the target platform and it’s assembly instructions and want to manually lay out exactly the exact instructions you want to run.

#1 burst is competitive in against C. There is still lots we can and will do. As you found there is benchmarks slower & faster. Generally it’s not a very predictable model. Very hard to find out why something is slow or fast. This is also the domain that most optimisation research in the C/C++ compiler space goes into. The biggest issue in this space for games is that a tiny change to the code can result in massive differences to performance due to non-predictable nature of this approach.

#2 is what burst is best at. Thats where we currently focus on. To a large extent this comes down to usability of the math library. Making it easy to write SOA SIMD code.

#3 is something where we want to add architecture specific intrinsics for to make this true

It seems like the benchmark is all about #1. So essentially it’s a benchmark for code that is not written to be optimised. Don’t get me wrong, most code out there is written exactly like that, so there is value in a compiler making it as fast as it can. But if you care about performance, thats not how you write code. So a benchmark purely focused on #1 doesn’t seem right.

4 Likes

@Arowx I intentionally used expensive algorithms for tests similarly to how compiler engineers are doing that. The goal is to get bare numbers that compiler could give you without spending human’s energy on optimizations.

@Joachim_Ante_1 Yes, I absolutely agree. Since the Burst itself is not a general-purpose compiler and at the tip of the iceberg, it’s a transpiler essentially which designed for specific use-cases. I’ll add more benchmarks that will cover various cases very soon, just for experiments. Thanks.

Added Sieve of Eratosthenes, GCC is 1% faster than Burst:

(Burst) Sieve of Eratosthenes: 43,449,732 ticks
(GCC) Sieve of Eratosthenes: 42,965,656 ticks
(Mono JIT) Sieve of Eratosthenes: 55,741,659 ticks

IL2CPP versions of those benchmarks might be interesting. I guess you’d probably just need to prevent burst compilation of the job code, and build an il2cpp version of the player.

Yes, I was thinking about this. I’m going to provide IL2CPP results very soon after adding tiny Pixar Raytracer.

Attempt at vectorising the NBody simulation…

// NBody

    private struct NBody
    {
        public double3 xyz, vxyz;
        public double mass;
    }

[BurstCompile]
    private unsafe struct NBodyBurst : IJob
    {
        public uint advancements;
        public double result;

        public void Execute()
        {
            result = NBody(advancements);
        }

        private double NBody(uint advancements)
        {
            NBody* sun = stackalloc NBody[5];
            NBody* end = sun + 4;

            InitializeBodies(sun, end);
            Energy(sun, end);

            while (advancements-- > 0)
            {
                Advance(sun, end, 0.01d);
            }

            Energy(sun, end);

            return sun[0].xyz.x + sun[0].xyz.y;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private void InitializeBodies(NBody* sun, NBody* end)
        {
            const double pi = 3.141592653589793;
            const double solarmass = 4 * pi * pi;
            const double daysPerYear = 365.24;

            unchecked
            {
                sun[1] = new NBody
                { // Jupiter
                    xyz = new double3(  4.84143144246472090e+00,
                                        -1.16032004402742839e+00,
                                        -1.03622044471123109e-01 ),
                    vxyz = new double3( 1.66007664274403694e-03 * daysPerYear,
                                        7.69901118419740425e-03 * daysPerYear,
                                        -6.90460016972063023e-05 * daysPerYear),
                    mass = 9.54791938424326609e-04 * solarmass
                };

                sun[2] = new NBody
                { // Saturn
                    xyz = new double3(8.34336671824457987e+00,
                    4.12479856412430479e+00,
                    -4.03523417114321381e-01),
                    vxyz = new double3(
                    -2.76742510726862411e-03 * daysPerYear,
                    4.99852801234917238e-03 * daysPerYear,
                    2.30417297573763929e-05 * daysPerYear),
                    mass = 2.85885980666130812e-04 * solarmass
                };

                sun[3] = new NBody
                { // Uranus
                    xyz = new double3(1.28943695621391310e+01,
                    -1.51111514016986312e+01,
                    -2.23307578892655734e-01),
                    vxyz = new double3(2.96460137564761618e-03 * daysPerYear,
                    2.37847173959480950e-03 * daysPerYear,
                    -2.96589568540237556e-05 * daysPerYear),
                    mass = 4.36624404335156298e-05 * solarmass
                };

                sun[4] = new NBody
                { // Neptune
                    xyz = new double3(1.53796971148509165e+01,
                    -2.59193146099879641e+01,
                    1.79258772950371181e-01),
                    vxyz = new double3(2.68067772490389322e-03 * daysPerYear,
                    1.62824170038242295e-03 * daysPerYear,
                    -9.51592254519715870e-05 * daysPerYear),
                    mass = 5.15138902046611451e-05 * solarmass
                };

                double3 v = new double3();

                for (NBody* planet = sun + 1; planet <= end; ++planet)
                {
                    double mass = planet->mass;

                    v += planet->vxyz * mass;                                     
                }

                sun->mass = solarmass;
                sun->vxyz = v / -solarmass;
            }
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private void Energy(NBody* sun, NBody* end)
        {
            unchecked
            {
                double e = 0.0;
                double imass;
                double3 ixyz, ivxyz;
                NBody* bj;
                NBody* bi;
                double jmass;
                double3 dxyz;

                for (bi = sun; bi <= end; ++bi)
                {
                    imass = bi->mass;
                    ixyz = bi->xyz;
                    ivxyz = bi->vxyz;

                    e += 0.5 * imass * (math.length(ivxyz) * math.length(ivxyz));

                    for (bj = bi + 1; bj <= end; ++bj)
                    {
                        jmass = bj->mass;

                        dxyz = ixyz - bj->xyz;
                      
                        e -= imass * jmass / Math.Sqrt(math.dot(dxyz,dxyz));
                    }
                }
            }
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private double GetD2(double dx, double dy, double dz)
        {
            double d2 = dx * dx + dy * dy + dz * dz;

            return d2 * Math.Sqrt(d2);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private void Advance(NBody* sun, NBody* end, double distance)
        {
            unchecked
            {
                double3 ixyz, ivxyz, dxyz;
                double jmass, imass, mag;
                NBody* bi, bj;

                for (bi = sun; bi < end; ++bi)
                {
                    ixyz = bi->xyz;
                    ivxyz = bi->vxyz;

                    imass = bi->mass;

                    for (bj = bi + 1; bj <= end; ++bj)
                    {
                        dxyz = bj->xyz - ixyz;
                        jmass = bj->mass;
                        mag = distance / math.lengthsq(dxyz);

                        bj->vxyz = bj->vxyz - dxyz * imass * mag;
                        ivxyz = ivxyz + dxyz * jmass * mag;
                    }

                    bi->vxyz = ivxyz;
                    bi->xyz = ixyz + ivxyz * distance;
                }

                end->xyz = end->xyz + end->vxyz * distance;                             
            }
        }
    }

My theory is that the use of double3 should allow Burst to vectorise more operations??