Improve CPU performance in build. I don't know what's the next step.

Arnold_2013 · August 3, 2022, 11:24am

I am using DOTS to make a PC VR ‘RTS/Tower defense’ game. Massive unit count is more fun and the amount of units will also be the simple knob to turn down should performance be insufficient at some point.
My own machine is the target platform: 3700x ryzen CPU, 5700XT GPU, Windows 10, SteamVR 120fps Valve Index.

I am using URP, forward renderer, hybridRenderer V2, entities 0.51, unity 2021.3.5f, Havok Physics.

The following image was made with a IL2CPP development build, with the profiler opened in the unity editor (cant get profiler to work without editor being open, but I hope this is good enough for now). There are ~300 units in the level at this time (so not massive, but could be enough for a game). Currently no animation.

I wrote a few comments in magenta to indicate my current thinking at this moment in time.

From the WaitForGfxCommandsFromMainThread I am assuming I am CPU bound
I understand VR cannot run constantly changing framerates since the user would become seasick. So this part will auto remove when the performance is good enough to run 120fps.
I disabled the camera occlusion culling. I have an out door scene [image below], so the CPU overhead would not be worth it. Was hoping this would remove the “0.5 ms of CullScriptable”, but since it did not I assume this is the frustum culling.
With a 256mx256m terrain and a 20m high player the realtime shadows don’t fit the entire view distance, so I baked the lighting. But now lightprobes are needed to have any form of shadow… more light probes impact the cpu performance.

-Audio/VFX graph companion GameObjects are currently not pooled (unless component objects have this automatically somehow), maybe this would help?

I dont use LODs so the vertex/triangle count is far higher than it needs to be. But I don’t see the CPU impact of this in the profiler. Could this help?
I see a lot of Idle threads, so I think I need to schedule better. But everything that uses a Translation waits for the physics step to finish. Luckily these can be scheduled while the physics is running on the worker threads.
physics takes up a lot of the time, but the game leans heavily on the physics engine for movement, collisions and triggers. I saw a post on the forum about replacing physics projectiles for ray casting. This is on my mind, but there are not a lot of projectiles in this profiler frame.

How should I proceed to improve performance in the current state?
Are there mistakes in how I currently think about this profiler result?
Is the amount of time used physics/rendering/… normal?
Reducing individual system times would take the most effort with the least amount of improvement, I think?

Any insight is appreciated.

This is the screenshot around the time of the profiler frame (there are more units that are not rendered due to fog of war effects). The image brightness is increased to better show the scene.

xVergilx · August 3, 2022, 1:41pm

That’s too many light probes. In reality - you don’t need that many. Place them only in places where lighting change is obvious. Less is better, because objects that are affected by different light probes are not batched together.

Check Frame Profiler for more details (such as Draw Call count).

No idea about Havok Phys engine though, haven’t used it personally. But DebugStream looks like some sort of debug output. Gizmos / other debug info perhaps?

Yes, always pool objects that are on MB side that gets Destroyed / Instantiated often.

Arnold_2013 · August 3, 2022, 2:53pm

Thank you for the info. I’m going to reply point by point, but don’t feel pressured to give all the answers. Its also for me to just get my issues organized (its a lot of text).

I’ll reduce the probe count

The frame debugger looks oke. When the terrain was still shadow casting the “number” was around 1500. now the number is 374. Is there anything special to look for? They don’t all cost the same amount of resources right? halve of these values is still the terrain.

Yeah, the debugstream itself is only 0.02ms, its just the system that forces all jobs to be completed which makes it look bad. Maybe this thing is not in a normal build, but I can only profile a developer build.

These are the physics settings that I have (Enable sleeping did not work with my triggers, they would stop sending a trigger per frame while asleep) :

During conversion I use
dstManager.AddComponentObject(entity, visualEffectCompanion);
This automaticaly creates a ‘normal’ GameObject with MonoBehaviour VFX… It creates a “…(Clone)” on the hierarchy, which are used as prefabs for the “…(Clone)(Clone)” when instances are needed.
I don’t know how to insert the pool into this conversion… ?? Should I create an ECS converted pool in the scene and fill it with ComponentObjects… and manually update their location to the entity that wants to use them. Or is there a way to tell the added ComponentObject to use a pool?

Thanks.

xVergilx · August 3, 2022, 3:18pm

Less == better. Look into cases where calls are not batched together. See draw call budget for the target platform.
Stuff like post FX tend to cost much more than simple draws.

Perhaps disable system via Enabled? Could help if nothing else requires it.

I don’t use default authoring / conversion. So as a suggestion for the “fire & forget” type of FX:

Make a data that you’re going to react to in a system (or a query);
In this FX spawn system, do a ForEach on that query with .WithStructuralChanges().WithoutBurst().Run()
Spawn required pooled VFX / SFX.
Position / Rotation etc can be taken from the entity directly via ForEach or by other means.

Prefab can be taken from some arbitrary ScriptableObject. This SO can be inserted to the system during authoring phase from MonoBehaviour, or by any other means.

So this will not require any companion objects at all, and you’re free to use whatever pooling backend you want.
Hybrid things sometimes are can be done more simply in a MonoBehaviour approach. Unless you’d want to “sync” position / rotation to the transform, then its a bit more trickier. Personally I use systems that sync position and rotation from the entity to the transforms via IJobParallelForTransform job at the end of the simulation.

Tl;DR: Sometimes companion object is not required at all.

DreamingImLatios · August 3, 2022, 4:20pm

I don’t think companion objects are actually your issue here. It seems that system is syncing on the transform system, of which you have one or two chunks with very deep hierarchies. You gotta get rid of that debug stream sync point in order to get any useful data, because a lot of your performance problems right now seem to be scheduling issues. I’m not even convinced reducing light probe count would even be necessary once your scheduling problems are resolved.

Arnold_2013 · August 9, 2022, 8:19pm

Thanks for the comments. The changes helped improve performance (its hard to compare since its always a random frame, but it feels better). Still don’t know how to further optimize the scheduling.

@VergilUa , I created a pooling system with ‘normal’ gameobjects to replace the componentobjects. It seems to work, and gives a lot more grip on the ‘normal’ gameobjects.

@DreamingImLatios , I disabled the PhysicsDebugStream System, now the EndFixedFrameEntityCommandBufferSystem causes the sync point. I don’t think there is a way around this, since the buffer will always create a sync point. I guess the xxSystemGroup ending will otherwise create a sync point… not sure.

Is the physics Job being done on the main thread a big problem? The only way to reduce this is to schedule more systems, but I don’t have 1ms of of main thread system scheduling left. Or put more systems that need to run on the main thread in this system group, but using a .Run() would also force all jobs to complete. So that would not work.

Or is the main scheduling problem is that I use Physics information (mostly the translate, but also the physics velocity) to do things, and these systems will always be waiting until the physics step is done. Or if they run before the physics build, the physics build will be waiting for them… Should I create a copy of all transforms + velocities so some systems could depend only on the (1 or more frames old) data and not be held back waiting for PhysicsExport? Or try to create a system that does a lot of work, but is not related to the physics like pathfinding or something to fill up the Idle worker threads (this would not increase the performance of the current game)?

DreamingImLatios · August 10, 2022, 2:27am

If you don’t create an ECB using a given ECB system, that ECB system will not generate a sync point. If you need the sync point, there’s no getting around that. With that said, there’s still some single-threaded jobs whose names I can’t read. It would be interesting to explore those next.

Arnold_2013 · August 15, 2022, 1:30pm

I did a big refactor to schedule(parallel) more and where possible switched the execution order to be less restrictive. The other threads are definitely more filled (looking at the small green bars everywhere). The overall improvement is in the 0.1 - 0.2 ms range compared to the previous version… at least it did not get worse :-).

Is there any way to see what is preventing tighter scheduling?
Currently I look at the profiler timeline, and if a system starts at the same time a different systems finishes… I look in the code if they are Dependencies… But sometimes the dependency is not direct… it could be via the data, or an unrelated system. Like “systemB” could have a UpdateAfter(SystemA), UpdateBefore(SystemC). Which could make me wonder why SystemC is not running at the same time as systemA… in Code you would have to look at the references of systemC and see it has a connection to systemA via systemB…
If I look at the Window → DOTS → Systems … I can see the relationships that are also directly in the code, but not the runtime “Dependency” including the indirect scheduling constraints.

I also tried to get the ITriggerEventsJob to run in parallel, but only got it to work in a hacky way, asked for advise in a seperate thread. ( How do I detect collisions? )

Thanks in advance

This is a zoomed in version of the current profiling data, the right side of the frame timing did not change.

Arnold_2013 · August 26, 2022, 8:21am

TLDR : Help, how do I know what my GPU (AMD) is doing with its time?

Deeper into the rabbit hole… It appears I am GPU bound now, and it looks like figuring out what the GPU is doing is much more difficult than the CPU. I might need to make a new post for this, but since this is a continuation of my struggle I’ll add a small update here.

Any advice is appreciated.

After some CPU improvements I thought SteamVR was harsh in switching back to 50% of FPS… So I went to the VR forum for help.

“WaitOnSwapChain” was actually indicating GPU could not keep up, and CPU was having more than 2 frames ready for the GPU, but the GPU was still lagging behind.
I was really expecting a “Gfx.WaitForPresentOnGfxThread” or “GfxDeviceD3D11.WaitForLastPresent”… but the “Gfx.WaitForRenderThread” that can be seen in the screenshots above. I attributed to their parent “XRBeginFrame” this being the “VSync of VR”. In VR having a stable FPS (even if its lower) prevents you from feeling motion sickness.

Applying a RenderScale of 0.77 (with FSR 1.0) upscaling did improve overall FPS. So it was time to look at GPU profiling. FrameDebugger gives some info on draw calls, but nothing on real performance. The GPU Module on the profiler only works when disabling graphics jobs in the settings.

But now I am still not getting the information I need… This is while running in editor :

I get a 3 ms difference on the GPU in the “Ini_ScriptableRenderContext.Submit”, but the children of this line don’t add up to this value… So I guess its ‘self’.
I am guessing it has to do with the Physics step being in the fast frame and updating all the transforms which need to be send to the GPU… But most units are not visible (their render mesh has not been instantiated)… the terrain is a lot of vertices but its static so there is no need to update it’s transforms.

I would love to know for sure what the GPU is doing in this extra 3ms, but also what is costing the other 3ms. Because terrain popping in VR is very noticeable I have to have the pixel error set to 1. So just seeing the terrain could be a big cost.

I’m on a AMD GPU, they have a profiler but it only works with DirectX12 (Radeon™ GPU Profiler - AMD GPUOpen)… which for unity is still experimental… also saw RenderDoc being mentioned…

Frame N :

Frame N+1 :

Vert/Tris count ect. :

Screenshot for idea. Looking at bottom of terrain with some visible towers on top (RenderScale = 0.5, so it is really blocky):

Topic		Replies	Views
Very Bad Performance Unity Engine Performance	36	12053	July 6, 2023
Improving Performance In HDRP (7.3.1) Unity Engine HDRP , com_unity_render-pipelines_high-definition	49	11976	February 8, 2022
Vive VR: Spikes + How to interpret "VR.WaitForGPU" Unity Engine XR	43	22699	September 28, 2023
Help getting 60FPS Unity Engine Graphics , Performance	49	3922	February 9, 2019
Unity job system is not working as expected Unity Engine Job-System , SRP	38	6635	January 16, 2020

Improve CPU performance in build. I don't know what's the next step.

Related topics