I have been working on trying to speed up our Unity build pipeline for a few days now, and have noticed something I thought was peculliar.
Throughtout the execution of the pipeline, many different subprocesses get started:
gmcs, for compiling the CSharp code
Assets building/downloading from cache server and scene baking ans sprite packing is done in the Unity process itself
UnusedByteCodeStripper2
il2cpp then gets executed
What I noticed is that througout that process, we never use more than 100% of a single core in terms of CPU power. And since there are no flags (that I could find) to set job count (think make -j, msbuild /mp, etc), I essentially have no way to use the full resources of my server to speed up the build process.
In comparison, for Android and IOS projects (the ones generated by Unity) I was able not only to speed up builds by tweaking the number of jobs, but I was also able to use distcc to increase the processing power available to the build process.
I would especially wish for il2cpp and scene baking to use all available cores. In the best of all world, to be able to distribute the process either through distcc or through a Unity tool that would sit on builder nodes would be even better
But back to my original question: is the Unity build tool stack really limiting everything to single core usage? Is there a flag, set of flags, or alternative build process I should use to help Unity make full use of my build machine resources?
The gmcs and UnusedByteCodeStripper2 tools are running single threaded, and probably will continue to do so. However, they usually make a very small portion of the time for the overall toolchain, so they probably won’t be made parallel soon.
Assets building/downloading, on the other hand, can be very time consuming. We have active work going on to make it parallel. That work is still some time from shipping, as this is really complex and must be done correctly, but it is on the way.
IL2CPP does it’s code conversion (C++ → C#) single threaded. We’ve looked at making this parallel, and may indeed do more of that in the future, but many parts of the conversion process are inherently serial, so we can’t gain too much of a win here. Again though, the code conversion is not the biggest time sink for a build. The biggest part of the build from IL2CPP is the C++ compilation step. That indeed is done in parallel (or should be) in the same way that something like distcc works, by compiling multiple translation units at the same time. Of course the link step is serial, but that should be it.
Agreed, compile and byte code stripping isn’t really an issue. Although upgrading to C#6 compile stack (so replacing gmcs essentially) did shave about a minute of total build time… which isn’t a whole lot overall, but was a nice surprise
il2cpp code conversion though? 10 minutes if needs to convert from scratch (5 minutes if already built).
You seem to say that the C++ code does get compiled. I haven’t seen that happen, nor have I seen any parallel operation happen there either. Checked for compiler subprocesses, pstree, nothing.
If you mean XCode or the Android project itself, that is what I previously mentioned as having optimised with distcc. That I can get to become reasonably fast. On top of distcc I also use ccache over a distributed file system, and even with the network hit I can easily bring a 10 minutes build down to 2 (XCode, still working on measuring for Android at the moment).
To be honest, Unity is really getting in the way of itself when it comes down to making builds that can run fast. The generated projects or the compile path to native code for given platform is pretty much a non-issue at this point, really.
Going through il2cpp dlls it seems like there is some code to deal with running stuff in parallel: x.com (apologies for the tone of the tweet, I was a bit exasperated about the whole thing and didn’t take into consideration that this might be present for experimental reasons).
At this point, I just don’t know what to do to make things faster, because buying better hardware wouldn’t event make much of a difference (already on SSD, I could get a better CPU but then again wouldn’t end up using much of it).
QUOTE=“stelcheck, post: 2782703, member: 791621”]Agreed, compile and byte code stripping isn’t really an issue. Although upgrading to C#6 compile stack (so replacing gmcs essentially) did shave about a minute of total build time… which isn’t a whole lot overall, but was a nice surprise
[/QUOTE]
Yes, newer C# compilers are running much faster. If you have not done so yet, check out the preview build of Unity 5.4 we shipped with the C# compiler from Mono 4.4. We’re seeing large improvements in C# compile time with this version: http://forum.unity3d.com/threads/upgraded-c-compiler-on-5-4-0p4.430197/. This will ship officially in Unity 5.5.
This seems a bit odd. IL2CPP code conversion from IL → C++ does not do anything incrementally, so it should not matter if things are already built. Ten minutes seems like a long time, although it is not unheard of on some projects. If it is possible for you to share your project in a bug report, we can have a look at it for profiling - we’re always looking for stress test cases.
I should clarify, sorry. Which platform are you working with? The way the C++ code is compiled differs per platform.
Generally though, I don’t have any good suggestions for making things compile faster. Depending on your platform, we have a few improvements in the pipeline. But we’re always looking to improve, so I appreciate your willingness to look at this.
Unfortunately, I cannot share anything about the current project I am running things against. However, the most lengthy parts of compilation seem to be coming from transpiling DLL’s all games would need (mscorlib, Unity DLLs, etc).
So far, have been running mostly XCode builds.
I am not much of a C# coder, but I figured now would be as good of a time as ever to explore.
So far, what I find is that the two biggest time consumers are WriteGenerics and AllAssemblyConversion (taken out TinyProfiler), with easily 7 to 8 minutes of the total time. Some parallelism does seem to be achievable, however; there is already some parallel-execution-ready code around SourceWrite.WriteGenerics, but somehow rewriting this with just Parallel.Invoke makes the code not break and overall a minute faster on the project I am testing.
I tried the same strategy around other chunks, but clearly the issue appears to be centered on NamingComponent. It’s that, or one of the other singletons defined through IoC.
As I mentioned I am not much of a C# programmer; I really just took the opportunity to learn a thing or two in the process. But I am curious; from what I can see, the major pain point for proper multithreading doesn’t seem to be logical (the preprocessing phase takes some time but not a whole lot in comparison and everything gets put in order there before any further processing is done) but architectural (LOTS of singletons where having a pool of reusable resources would be better). Am I reading this one correctly? Are there any specific reasons why the assembly conversion and/or the WriteGenerics phases have to be so sequential? Of course, I do not wish to underestimate the amount of effort that will be needed to add more parallelism to the current codebase, but I am curious as to whether current limitations are unavoidable.
Finally, I am wondering why Unity DLLs and mscorlib have to be recompiled at all? Couldn’t those files just be compiled once, and then simply added from cache to the project? That alone looks like it would be saving an incredible amount of compile time.
Unfortunately WriteGenerics needs to be sequential, at least for the way we have chosen to implement it. The key question here is whether we want to have duplicate generic code. If List, for example, is in two different assemblies, IL2CPP will use only one definition for that type and its methods. If we processed the assemblies in parallel, we would either need to have a different definition per assembly (increasing code size) or add another step to the code conversion pass to remove the duplicate definitions. We’ve found that leaving this step serial is actually a better option for performance now.
NamingComponent is certainly a bottle neck now, but we’ve done some recent work to make calls to it idempotent, so that we have the possibility of generating names for C++ types on multiple threads is there. We should spend some more time with this and see if we can make the code which uses it more parallel.
The main reason Unity and mscorlib assemblies are not cached is code stripping. A small change in the script code can cause large parts of these assemblies to be stripped or not, meaning they would need to be converted again. This is certainly a solvable problem though, and is one that is on our radar for optimization.
So I guess the bottom line here is: you have identified some places we can improve code conversion performance. I think that a number of these changes would be nice, and we will consider working on them as we have time.