HPC# vs .NET Core 3 Benchmark

After watching one of the talks on Burst compilation from Unite Copenhagen, I became a bit suspicious of the performance gains being presented so I decided to run some simple benchmarks of my own as a sanity check (see details below).

While I did find a significant performance difference between Burst-compiled C# and standard C# in Unity (as showcased at Unite), the difference between Burst-compiled C# and standard C# compiled with .NET core 3.0 was negligible - in fact Burst-compiled C# was slower until safety checks were disabled.

Needless to say these findings were pretty disappointing and have left me feeling somewhat misled by the folks at Unity. While I appreciate the coordinated push towards data-oriented design, it would appear that “performance by default” is something that the language already offers out of the box - just not within the development environment provided by Unity.

In any case, I thought I’d share the results here in case my benchmarks are deeply flawed in some way that I’ve completely overlooked. Given that I’ve spent the last few weeks working on Burst implementations of performance sensitive algorithms, I sincerely hope they are.

Details

The benchmark performs 1 million axpy operations on single precision 3-dimensional vectors.

5080205--500525--axpy-bench-summary.png

Potentially(?) relevant Unity settings:

  • Project Settings

  • Burst AOT settings

  • Optimizations: enabled

  • Safety checks: disabled

  • Burst compilation: enabled

  • Player settings

  • Scripting runtime version: .NET 4.x Equivalent

  • Scripting backend: IL2CPP

  • C++ compiler configuration: Release

  • Preferences

  • External Tools

  • Editor Attaching: disabled

  • Edit: Add table
  • Edit: Update benchmarks with “Editor Attaching” disabled
2 Likes

You really need to benchmark using the same method to record performance, especially when your benchmark is very short (2ms).

Anyway nxrighthere already built an thorough suite of benchmarks comparing Burst to a range of different compilers (GCC, Clang, IL2CPP, Mono JIT, RyuJIT). Thread over here: Benchmarking Burst/IL2CPP against GCC/Clang machine code (Fibonacci, Mandelbrot, NBody and others)

5 Likes

Thanks, should’ve known someone had already done a deep dive on this. Looks like a much more useful/complete comparison.

That said, these don’t seem to capture the massive performance difference between Burst-compiled C# and standard C# when writing code inside Unity.

What concerns me here is that this difference is being presented as a feature but, given the performance of the language outside of Unity, I see it more as a bug in that there appears to be a significant performance penalty for writing non-Burst C# in Unity.

1 Like

The original Unity implementation is the MonoJit one in the table, so it does show the performance difference. The Mono version used by Unity is ancient (based on version 2.7 or something like that IIRC) and the quality of the actual CPU instructions produced by that version’s jit compiler is somewhere between abysmal and laughable. So yes, that’s definitely slower than what you’d get when running C# using the most recent versions of Mono of RyuJit :slight_smile:

1 Like

Additionally while nxrighthere benchmark suite is definitely real world use cases, that do exist, it does not cover the ones most important for performance in games. The benchmark is not what burst is currently optimized towards. Burst is optimized for writing SIMD code, making that process simpler & performant.

None of the examples in the benchmark showcase that. It is difficult to compare to other languages because it requires a combination of math library & compiler…

I am not trying to take a piss on the benchmark. It is definately real. There is plenty of code that is scalar & relies on auto-vectorization to get faster. And burst is generally on par or better than C++ at it. And signficantly better than RyuJIT at it but it’s not been the primary focus.

What we primarily want to enable with burst is writing high performance SIMD code. Auto vectorization is nice to have, but not the primary goal of burst. Primarily because we want to make sure that programmers can control performance & not get it accidentally.

This is difficult to benchmark because there is no apple to apple comparison here. Because really it comes down to making it easy to write high performance code with a good math library supported by a compiler. This is something that doesn’t exist anywhere else.

Do note that publishing to the new Dots Runtime that we will ship in preview in the next couple of months supports .NET core as the deployment target. Additionally if you compare CPU performance, we generally put all effort in optimization for normal OO code into IL2CPP not into mono. The expecation is that for final builds our users should use IL2CPP if performance is a concern in any shape or form…

10 Likes

Dude… You are supposed to turn them off during a benchmark. They only exist for development-time purposes.

2 Likes

In the benchmark, sometimes RyuJIT is faster or equal to IL2CPP. But IL2CPP is not very suited for Games with mod-support.

3 Likes

Ah, thanks - that’s very clear.

Doesn’t this make some of the live Burst demonstrations feel even more misleading though? It’d be like if I chose to demonstrate how fast my car was by racing it against a go-kart…

1 Like

AFAICT, safety checks in this particular benchmark amount to indexer bounds checks on NativeArray*. These are always performed on managed arrays in C# but can be optimized away by the compiler in some circumstances. Seeing as how this is an attempt to compare compilers, one might argue that it’s more fair to keep the safety checks on then. In any case, I gave Burst the benefit of the doubt and turned them off in the end so not sure what the issue is.

  • Please correct me if I’m wrong here

** Edit: Add link

You’re welcome :slight_smile: I guess in the end it depends on who you are talking to: If you have been working with Unity for a few years now, these speed-ups are very real and very relevant. I agree that this perspective is probably rightfully different for someone coming from a different background. I’d say that some context is implied when Burst is mentioned, namely that Burst gives you much better performance while still using Unity. On that ground, that’s still very exciting :slight_smile:

I also agree with Joachim here that the story is more nuanced, really: It’s not just this performance gain, it’s also cross-platform SIMD, having a compiler that knows about Unity, can do aliasing analysis, owning the whole toolchain etc.
Unfortunately, I only know a handful of people that get as excited about cross-platform SIMD :smile: whereas more performance for Unity NOW is something most people immediately get and is probably a better marketing move :wink:

I’ve taken a quick look at your benchmark (thanks for posting! :)) and there are some caveats – maybe you already caught those since I think they are also mentioned in the benchmarking thread?

  • consider using synchronous compilation for Burst (you can specify that via an argument to the BurstCompile argument) since you otherwise might also benchmark compilation
  • in the serial example, consider using Run() instead of Schedule().Complete() since the latter will also go through the job system (unnecessarily so)

I’m not a 100% sure whether that will change your benchmarks since you said that you were merely eyeballing them from the profiler, so maybe those pitfalls don’t apply

1 Like

As far as I know all safety checks are stripped away if you turn them off.

Sounds great, how limited out of the box in terms of extensibility it will be? Will we be able to use a full vector of C# features, managed objects, and 3rd-party libraries?

One of the use cases we want to enable is that you might want to have a kestral web server but drive world creation & loading from there. So you can be in charge of the main loop from your own C# code, which opens up DOTS simulation code to a whole category of different uses on servers etc.

7 Likes

Yeah, I can appreciate this. I’ve been using Unity for some time now but I suppose that my current context is somewhat unusual i.e. developing external libraries for use across a few different platforms (one of which being Unity). It seems like my options here are limited to (1) include Burst-compatible implementations of certain algorithms and data structures or (2) accept the performance regression that comes with using external libraries in Unity due to Mono JIT.

Thanks for taking a look. Making these changes appears to have slightly improved the serial implementation but it’s hard to say for sure without taking more rigorous measurements. In any case, these sorts of details are good to know about so thanks for the feedback.

An additional note for the Mono JIT benchmark: disabling “Editor Attaching” (as suggested in this blog post) makes a pretty significant difference so I’ve updated the summary to reflect this.

Mhm, I don’t think there really are other options if you want to stick to C#. If you are aiming for high performance, you should also keep in mind that the performance characteristics of Unity’s Mono version can be quite different compared to the latest .NET Core runtimes; there’s a different GC in place and Unity still lacks support for Span etc. afaik.

This is probably not all that relevant to you anymore, but I’ve taken another close look at your benchmark and wanted to share some findings :slight_smile: First, I’ve changed your code to use 4-wide vectors (nicer alignments). Then I put the benchmarking code into SharpLab online to see what the latest RyuJit would do to it (select JIT Asm and Release on the right hand side). The relevant code for the parallel version is the very last function, AxpyBenchmark+<>c__DisplayClass11_0.<AxpyParallel>b__0(System.Tuple2<Int32,Int32>)`.
I notice the following things:

  • There is a redundant vzeroupper in the parallel implementation (note how all latter SSE instructions use the VEX prefix). The story here is that XMM registers extend into YMM registers and the old SSE instruction isn’t aware of it, so newer processors with AVX support (~85% of users on Steam, see Other Settings on the bottom of this page) pay a penalty for intermixing old SSE instructions and new YMM instructions since the processor needs to ensure that the upper bits of the register are unaffected by the legacy SSE instructions. vzeroupper gets rid of that, but so does using the new VEX-prefix instructions (like vmovss instead of movss). See this post for someone much smarter than me explaining the issue. Now calling this vzeroupper instruction is dirt cheap for all I know, but I’m bringing this up because I want to get back to it later.
  • The output lacks any form of vectorization, but has detected that AVX is available (since it’s using SSE instructions with VEX prefix). For example, this is the load of A and the multiplication with X:
    L0032: vmovss xmm0, dword [edi]
    L0036: vmovss xmm1, dword [edi+0x4]
    L003b: vmovss xmm2, dword [edi+0x8]
    L0040: vmovss xmm3, dword [edi+0xc]
    L0045: mov edi, [ecx+0xc]
    L0048: cmp edx, [edi+0x4]
    L004b: jae L0105
    L0051: vmovss xmm4, dword [edi+edx*4+0x8]
    L0057: vmulss xmm0, xmm0, xmm4
    L005b: vmulss xmm1, xmm1, xmm4
    L005f: vmulss xmm2, xmm2, xmm4
    L0063: vmulss xmm3, xmm3, xmm4
  • The compiler did not eliminate the bounds checks on the array accesses, not even in the serial version. This is easy to spot for this compiler because there is a call plus trap-to-debugger interrupt in the very end of the compiled procedure; the cmp edx, [edi+0x4] followed by jae L0105 in the piece above is exactly one such bounds-check plus a jump to the exceptional path.

So there is room for improvement here :slight_smile:

Let’s look at the Burst compiled version. I have again modified it to use float4 instead of float3. I’m using Burst 1.2 preview-6 though I get the same result for Burst 1.1.2. You can look at the code yourself by using the Burst inspector from the Jobs/Burst toolbar item. I have disabled safety checks and fast math (the latter setting does not affect code-generation in this specific example but is generally helpful because it allows the compiler to reorder floating point operations more freely). The instruction set is set to auto, since this is also what the AOT code generation will use. This seems to use SSE4.x instructions and have no support for AVX (which makes sense when you build for x64 Windows since that is guaranteed to have some SSE support).

  • The core loop boils down to this:
.LBB0_5:
        movups  xmm0, xmmword ptr [rdx + 4*rcx]
        movss   xmm1, dword ptr [rbx + rcx]
        shufps  xmm1, xmm1, 0
        mulps   xmm1, xmm0
        movups  xmm0, xmmword ptr [rbp + 4*rcx]
        addps   xmm0, xmm1
        movups  xmmword ptr [rsi + 4*rcx], xmm0
        add     rcx, 4
        dec     rax
        jne     .LBB0_5

This is quite a bit shorter than the core part from the other compiler, though brevity does of course not always equal speed. This procedure is using vector instructions to do the actual computation: note the mulps, addps instructions (ps = packed single, ss = scalar single). The single scalar instruction (movss) is the load of X; the shufps right after that is broadcasting the value into the full XMM register.

  • Irritatingly, Burst (rather LLVM) decided to emit movups instructions instead of movaps since the system should be able to ensure that those float4s are 16 byte aligned.
  • All instructions are pure SSE without any prefix. This of course depends on the environment, but on my machine I might be paying for those if there is any, say, driver code, that felt clever for using my AVX2 enabled processor. Not sure what the right solution to this (since vzeroupper can only be used when you already know that you have AVX) and whether this is even a problem (I’d be curious to see whether a call to vzeroupper in there would improve things ever so slightly?)

Of course, the proof of the pudding is in the eating and ultimately it comes down to execution times, not a literary review of the generated assembly. So why isn’t Burst faster in this case? Well, I’ve only had a cursory look at it using VTune and it claims to be memory bound on my machine. At least on my machine, I can substitute A with math.sqrt(A) for free, no additional runtime cost. Similarly, I get a quite dramatic speedup when I write back into A instead of Result, even though the only difference in the assembly is that movups xmmword ptr [rsi + 4*rcx], xmm0 uses rdx instead of rsi.

So that was a lot of fun :slight_smile: I guess the lesson here is that sometimes benchmarks don’t measure what you expect them to measure. This benchmark probably can only distinguish very bad code generation from somewhat decent code generation, but beyond that it won’t help: The difference in theoretical latency of the generated instructions will be dwarved by memory access times.

Edit: I have repeated the measurements using float3 and the same conclusions apply. It’s not enough computation per read/write.

4 Likes

So in theory our Burst code could waste compute cycles and bandwidth by not doing enough processing on the data?

Does this mean there will be a sweet spot between data size and code size that will maximise throughput for Burst code on different hardware, and without using a tool like VTune we have no way of knowing how efficient our code is using the cpu and available memory bandwidth?

As I had the concept that with a DOTS based approach once you have written a small library of atomic systems* that cover most games then it would just be a matter of interlinking those systems and adding some meta game code and you would be able to write any game.

It sounds like a flaw in going for small code atomic system will be available memory bandwidth, are there any ways to detect this issue in the Unity profiler?

Admittingly I thought that as there are only about 20-50 operations that can be performed on a CPU if you write a system for +,-,/,,&,|,! and vector ops, etc then you could make any super fast program using a range of 20-50 atomic DOTS systems (Meta programming).

Just bumping this since I looked it up for another thread.

I feel like the real comparison shouldn’t be vs Burst, but rather in speeding up the main thread, which is likely the bottleneck even in many DOTS-enabled games. A lot of the core logic (and especially load-time logic) lives in the main thread. Jobifying things takes a lot of effort (engineering time/dollars), and older code/asset store stuff is likely not jobified.

An order-of-magnitude faster scripting runtime/GC on the main thread would help out everybody, and even in games with heavy use of Burst would improve editor performance and initialization times.

Burst is incredible for SIMD math and “embarrassingly parallel” problems, but there’s a lot of code that doesn’t fit that model.

Here’s a post form 2018 where a Unity engineer did it for a (hackathon?) and describes some of the challenges: Porting the Unity Engine to .NET CoreCLR | xoofx

1 Like