ECS has a lot of tiny wrapper methods that basically compile down to just a few (or zero) assembly instructions. However, when using deep profiling, all these tiny wrapper methods result in a huge amount of overhead due to the sheer number of them. Don’t be fooled by the time values in the profiler view, the profiler tries hard to subtract away the overhead of the actual profiling. But it can definitely be felt when the whole game runs at barely a few fps (in contrast to over 100 fps without deep profiling).
In just a few seconds, it racks up almost one million calls to IntPtr.op_Explicit. This method is basically a no-op. It reinterprets a value to a pointer type. Similarly, Burst.CompilerServices.Hint.Unlikely is just an annotation method, and should compile down to literally nothing when handled correctly.
I argue that Unity should annotate these tiny methods (and similar ones) with [ IgnoredByDeepProfiler] to reduce the overhead, and declutter the profiling view.
Am I right in guessing that this is running with Burst compilation disabled? (Because if Burst compilation was enabled, most of those tiny methods would be inlined, or (in the case of Hint.Unlikely) removed altogether.
Burst compilation is enabled during these tests. However, you are right that all of these tiny calls come from non-burstified code.
The burstified code runs ridiculously quickly, but not everything can run using burst. Some systems need to interact with the rest of the world, and some MonoBehaviours have to access entity data.
I think it would be good to ensure deep profiling is at least moderately performant when using non-burstified ECS. Especially given that it seems like a very easy win. Just add a few attributes. (though IntPtr.op_Explicit might need special casing, since it’s not provided by Unity directly).
Deep profiling doesn’t really touch burstified code anyway.
Note that the green burst jobs take roughly the same amount of time, but the non burstified jobs slowed down by a ridiculous amount. Some slowdown is of-course expected, but 100x is a bit much, I think. Wouldn’t you agree?
To a large part, the slowdown is, I think, due to lots of tiny wrapper methods like IntPtr.Op_Explicit, CollectionHelper.AssumePositive, UnsafeList.get_Length, float3..ctor, Vector3.get_zero, NativeArray.get_Length, BurstCompiler.get_IsEnabled (since even my non-burst jobs try to utilize burst in some cases when it makes sense to do so).
It’s hard for me to verify that this is actually the case. But doing millions of tiny things tends to add up, regardless of how tiny they are.
Generally, it would be great if the deep profiler excluded auto-properties, trivial constructors, and trivial functions (e.g. those simply returning its argument), since this covers almost all the worst offenders, and they are usually never interesting to profile. But I understand if this is handled by different teams, and might be trickier to do.
Pitching in from a Profiler perspective here (and these are just my thoughts, so take them with a pinch of salt) but turning on Deep Profiling is useful to understand what is running. It’s not so useful for understanding how long it takes. I would recommend using it sparingly - turn it on, see what’s running, add your own markers if you need to and turn it off again.
(I assume you all probably know what’s happening, but for anyone who’s interested - deep profiling automatically adds profiling markers to all managed functions. Each marker records a little bit of data when you enter the function and the length of time when you leave the function. Recording this information will make code that calls a lot of functions look more expensive in the profiler, because they do more recording work. Sometimes it can also look worse when running managed code across lots of threads due to synchronization, though we do a lot to avoid this.)
We do have ideas about how we might improve this in future, but in the short term I stick with my recommendation above; try to keep use of deep profile to a minimum.
@peter_hall Thanks for your input. Sure, but it’s also useful if the game is playable.
I’m the author of a popular asset store package, and some of my users are complaining that this overhead is so great that they simply cannot use deep profiling in their game.
I would at least argue that auto-properties should be excluded from deep profiling. They are basically equivalent to field accesses, and it’s not like the deep profiler profiles every single field access, right?
Just to add to what @half_voxel has mentioned - here is the scene that basically gets 1-2 fps when using deep profiler. It’s a really basic scene with barely any code running and this same scene gets 100+ fps when doing it without deep profile.
Note there’s nothing running other than the A* AI system but nothing is being done (all the paths and destinations are null) and all the entities are idle.
When not using ECS the deep profiler can be used in this scene with 60+ fps
but turning on Deep Profiling is useful to understand what is running
Agree and that’s what we have been using the deep profile for but it is not very useful when the fps is in the single digits since we can’t get the game to the state we want to profile
That is a more extreme difference than I’d have expected. I’m interested in trying to find a way I can recreate this locally so that I can try to improve the situation. As suggested above, maybe we can exclude some things from the deep profile (or even just optimize it). If anyone here has a project I can use or tips on how to recreate this locally it would be much appreciated!
Yeah this can be recreated with just a simple scene with 400 units with FollowerEntity component placed on them. This scene runs 300~ fps without profiling, 250~ fps with normal profiling, and at 2.5~ fps with deep profiling on.
The performance degrades linearly up til like 1-2 fps with deep profiler on (so 200 units gets double the fps of 400, 100 units gets 4x the fps of 400, etc). However with normal profile the performance is pretty stable (just slight decrease) even when increasing from 400 to 1000, 2000, 3000 units, etc.
Thanks @xjjon .
And thanks @peter_hall for taking a look at it
If you want another example, you can download the A* Pathfinding Project Pro from the Asset Store, install the included samples, and search for the RecastExample scene. That should give you a test scene with a few hundred FollowerEntity agents. That’s the one which I took my screenshots in (though I reduced the number of agents to 200).
@xjjon’s scene might be cleaner, though, and should show mostly the same issue (except for (1) below).
There are two places which are the most interesting to investigate, I think.
MineBotAnimation.OnUpdate()
This method reads some data from entities, and in the process has a lot of overhead due to tons of auto-property wrappers. Each individual one is not that expensive, but when there are 200 of them, it adds up. The code for this is not that complex (most of the complexity comes from trying to access entities from outside systems in the most efficient way).
JobRepairPath. This usually runs twice per frame as a parallel job. It gets about a 100x slowdown by using deep profiling. To be fair, this job does a lot of stuff, and the code is a lot more complex than (1).