Why x86 is faster than x86_64 when playing stand alone build?

I finally did a build of my game but ran into a strange problem. I initially selected x86_64 as the architecture. I was told that you get a slight performance increase using the 64bit version vs the regular 32-bit. Then I played the game and found that it was running slower in the stand alone build than it was in the Unity editor.

I figured it was the large background image that was moving with a parallax factor of .02, so I disabled the background and tried it again. This time the game ran fine. However, I decided to test by using a much smaller image (128 X 128) stretched across the background to see if it was the large image causing the problem. Even with such a small image, the game ran slower than it did in the editor. Then for the sake of experimentation, I did a build using the 32-bit architecture instead. It ran with out any major fps hiccups like the 64bit architecture was having.

After doing some more experimenting, I found that if there was any image filling up the background regardless of its resolution size, it caused the game to run slower in the build at 64bit, not a problem with 32bit.

This seems counterintuitive to me. From what I understand, 64bit allows you to use more memory for an application. Why would a standalone build run slower in 64bit mode as opposed to 32bit mode. Is there something I am missing here? Are there times that 32bit will run a standalone better than 64 (assuming the computer can run both)? If there is anyone that can shed some light on this issue, it would be greatly appreciated. I searched Google, the forums, etc. all day and could not find an answer.

First, let me get the ‘disclaimer’ out of the way. There are myriad specifics that could be related, and since I don’t have your project on my machine, I can’t evaluate what you’re seeing specifically. There may be one or two simple things that could be changed that would completely alter your observations on any particular build. Beyond that, there are potential issues of performance in C# in the two CPU modes which I can’t completely account for, and may depend on how you build the C# portion of your application (which compiler, for example). There are also many permutations of RAM and cache configuration/speed that alter the comparison.

I have no data to prove this last point, but in the early release of 64 bit AMD CPU’s, there was considerable attention paid to making sure there was no performance penalty running in 32 bit mode. As both Intel and AMD have moved forward they may have paid less attention to 32 bit mode performance, so some CPU’s may have superior 32 bit performance compared to others relative to their 64 bit modes.

Now, with that out of the way, I can move on to the more general topic of comparing 32 and 64 bit performance differences.

While it is obvious that the CPU registers of the 32 bit mode are 32 bits wide, and the 64 bit mode registers are twice as wide, what may not be as obvious is that some of the code generated per instruction is larger, especially where parameters are involved. The Intel (and AMD) CPU’s actually switch modes internally to run 32 bit code. It isn’t merely a matter of running different code, the CPU switches ‘personality’ to run 32 bit code on a 64 bit operating system. Where the resulting binary code is larger, it isn’t double the size of the 32 bit counterpart. There is, however, a cost imposed where code must be pumped from RAM into the CPU for execution, and while the cache mitigates that to a great deal it is one means by which 64 bit is expected to impose a performance penalty. Beyond code size is the raw demand for data moved from RAM into the CPU where 64 bit registers require twice the volume.

Countering this, however, is the fact that in the 64 bit mode there are more registers. The 32 bit mode only offers EAX through EDX general purpose registers, while EDS and ESI are used for various memory referencing. Registers devoted to the stack, stack base, instruction pointer, etc. are off limits for calculations. In the 64 bit mode, however, there are 16 general purpose registers, so the potential for code optimization is much greater. Those can’t always be used to considerable performance gain, but on occasion they can be a powerful way to recoup power lost to the larger instruction weight described above.

When 32 bit applications run on a 64 bit machine, they use a 32 bit interface for the API’s involved, and that includes the graphics API. Where the 32 bit API may pack more ‘work’ in a smaller memory bandwidth, there can be a slight performance advantage to 32 bit code. This is minimal, but some very focused application demands may see outsized gains, and the graphics API might qualify under certain circumstances.

When C# code assumes integers are 32 bits (longs are 64), it is possible that under certain code development patterns there are penalties associated with a 64 bit build ensuring that integers are 32 bits wide. In native code (say C++), many compilers assume integers are 64 bits wide on 64 bit builds, but C# may impose an unusual impact on 64 bit targets that cause it to perform better in 32 bit builds. This could very easily be subject to a wide range of odd configuration boundaries, meaning that it may be true on one machine and not true on another. There can be odd intersections of machine level timing between the way RAM feeds data to the CPU and how it is chopped up when data is smaller than the native CPU architecture. I would argue, with only a few studies on the subject, that faster RAM alleviates this somewhat, so a machine with slightly older RAM technology may be more subject to a performance difference related to this point than newer ones.

We do generally expect to see 64 bit applications run faster than 32 bit applications when built entirely out of native code, with rare exceptions, but not by much unless the application makes use of 64 bit registers. For example, if code used two steps to calculate and use a 64 bit integer result on a 32 bit target, it would take longer than the same code that requires only 1 step on a 64 bit build. Convolutions on images, for example (running through all pixels to adjust brightness is a very simple example) should run much faster on 64 bit machines when the algorithm is fashioned to take advantage. CRC calculations similarly can benefit considerably. However, very branchy code performs about the same, and code that takes no advantage from the larger registers can run more slowly overall.

C# has one particular weakness along these lines. Internally every reference to a class requires the equivalent of a reference counted smart pointer (as it would be known in C++). It is quite rare for the usage count to exceed 6 or 8. let alone 2 billion or more. As a result, there’s a disadvantage for C# because there’s no choice, every reference is built the same way. In native code, like C++, programmers make choices precisely for performance reasons that may avoid using reference counted smart pointers. They cause a double de-reference (two hits against cache) when used, a count increment and a decrement (under bus lock) for each and every attachment and release. If code creates lots of these references, there can be considerable performance impact, overall, and it may differ between 32 and 64 bit targets. In addition, the GC system of C# may well perform faster in 32 bit memory model layout metrics than it would in 64 bit memory models. The pattern one writes in C# greatly impacts this issue. Novice and intermediate programmers may create “new” class objects frequently without realizing the performance cost in any mode, let alone noting it may have even larger impact in 64 bit mode than in 32 bit mode.

The ARM process isn’t immune, but has some advantages over Intel in the comparison. ARM has as many registers in 32 bit mode as it does in 64 bit mode, but it does have various instruction formats which are quite compact by comparison. Though ARM’s machine language is more RISC oriented (which means the program code is expected to be larger) than Intel’s language, the various modes of ARM’s instruction packing can overcome some of the impact. Still, there can be odd performance differences in ARM just as you can see in 32 bit mode builds.

More generally, unless an application is design to take advantage of the 64 bit processor’s larger registers, there’s no expected performance gain merely building a 64 bit target, and potentially (not always) a performance detriment. Engineers were (and still are) puzzled by the 64 bit smart phone, in part because they don’t usually have more RAM by large margins as you’d expect on workstation and laptop machines, and all those underutilized transistors take power.

None of this is intended to suggest it may be better to build 32 bit targets. On OS X, for example, depending on the version, it may be or become impossible to run 32 bit targets. If an application doesn’t require the RAM, if it isn’t possible to tune for the 64 bit CPU (which C# somewhat blocks), then a 32 bit build is likely to perform a little better, and impact the GC of C# less.

None of us have a clue just now about the actual impact on this comparison due to the recent Spectre and related CPU vulnerability patches and microcode updates being distributed. There could be ‘bugs’ or odd and as yet un-researched differences that occasionally give 32 bit builds some peculiar advantage.

I know there’s a lot that could be explored on the subject, but that covers what I believe are the major hits.