GPU Dynamic Upload Culling

Hello everyone! I hope you are all having a good week!

We recently had the Graphics Office Hours here, and during that time, some things were revealed about the future of ECS rendering. While most people I talked to seemed to be in favor, a lot of the reasons proposed contradicted my own experiences. I, myself, am very concerned that Unity might be making a big misstep.

The purpose of this post first-and-foremost is to educate. I in no way want to discredit the Unity engineers or decision-makers. I suspect very few people in the graphics industry know what I will be sharing. In fact, I discovered this by accident in an attempt to solve a different problem. While I will try to explain this in understandable language, I suspect this will likely go over many people’s heads. @VincentBreysse I saw you were answering many questions related to the future of ECS Rendering, and I suspect you may find this interesting. I hope you don’t mind the ping.

Also, as a disclaimer, it is very possible that there are things I am wrong about. I am eager to learn!

Background

I’ll first explain some of the concepts behind Unity’s high-performance rendering technologies. For those already well-familiarized, you may skip this section.

For a more complete background, I strongly suggest watching this video: https://www.youtube.com/watch?v=6LzcXPIWUbc

Unity has two high-performance rendering technologies, those being the Entities Graphics package, and the GPU Resident Drawer. These two technologies can coexist, but do not interact with each other in any way. Entities Graphics handles entities. GPU Resident Drawer handles Game Objects. Both of these technologies are backed by a low-level API known as BatchRendererGroup (BRG for short).

BRG combines both instancing and batching. The idea is that you can store many objects’ rendering data persistently in the GPU across multiple frames. However, unlike traditional instancing in Unity, you don’t have to rearrange and compact arrays to just what is supposed to be rendered. Instead, you specify which instances you want to draw, as well as their configurations, and BRG will take care of the rest. This allows other features like LODs and culling to participate in choosing which instances get drawn. And all of this gets jobified for maximum performance.

BRG was first rolled out for Entities Graphics, but it was then discovered that it could be used to make Game Object rendering faster too. And that’s how the GPU Resident Drawer became a thing. However, these two higher-level technologies have not evolved with the same features. Entities Graphics has a powerful per-instance material override system. And GPU Resident Drawer has LOD Crossfade and GPU Occlusion Culling. Naturally, I think everyone wants to see a little more convergence of these features. However, I encourage you all to not just consider the benefits, but also the costs.

A Warm-Up on GPU Occlusion Culling

This isn’t really what I want to talk about, but I worry that if I don’t, people will try to counter-argue using misinformation.

Not all occlusion culling is created equal!

Unity is a shader-heavy engine. Unity shaders generate many different variants. But besides the ones Unity provides out-of-the-box, there are tools like Shader Graph which generate even more shaders and variants, plus all the third-party assets that also ship with their own shaders. Juggling all these shaders and their properties and bindings incurs costs. These costs are paid for during the setup of draw calls. And while the SRP Batcher tries really hard to reduce this, there’s still significant costs to this in a production project.

The way GPU Occlusion Culling works, is that all candidate objects for rendering have their bounding boxes sorted and organized. And the meshes with the largest and closest bounding boxes relative to the camera will be rendered into an occlusion depth buffer. Then, a compute shader will test the bounding boxes of all these remaining instances, and mark them as visible or invisible in a GPU buffer.

What is important here is that while a GPU buffer contains info about which instances are visible, the CPU is still responsible for commanding the draw call. It can say “draw whichever instances in this buffer are visible once the compute shader is done”. However, GPU Occlusion Culling does NOT reduce draw calls! It does not eliminate the shader setup and all the variable bindings. All those costs still exist. Instead, what it is reducing are the number of triangles that are to be rasterized.

When rendering a scene, most of the cost falls into texturing, lighting, and post-processing. Vertex processing and rasterization, while not completely free, usually are not as heavy (this assumes your LODs are setup correctly to avoid the tiny triangle overdraw problem). If you do these calculations more than once per pixel (overdraw), that can add up. This is why many rendering engines opt to use a “depth pre-pass”. What this does, is rasterize all the geometry once without any of the complex texturing and lighting calculations to set up the depth buffer, and then re-rasterize all the geometry again, but this time only processing the pixels on objects that end up at the same depth level as the depth buffer. These are the pixels known to actually be visible, where texturing and lighting will matter.

The same hardware that is good with indirect drawing used by occlusion culling is the same hardware that tends to be good at and benefits more from depth pre-pass. And if it isn’t obvious, depth pre-pass is typically the stronger optimization of the two.

When both techniques are used together, what that means is that GPU Occlusion Culling is reducing the amount of vertices and triangles that get rasterized, but would have been caught and eliminated by the depth pre-pass anyways. This is why in many projects, GPU Occlusion Culling when combined with depth pre-pass usually only offers a few percentage points of improvements. It is worth using if you have it, as it is a very safe optimization. But it is not optimizing the slowest part of rendering. It tends to be especially good when you have lots of small instances, such as vegetation, or for when your vertex shader is more complex, such as water or GPU animation textures. It would also be great for skinned meshes if it weren’t for the fact Unity opts to use compute shaders for skinning. There are definitely some projects out there that will see massive gains with GPU Occlusion Culling. But this isn’t the norm.

Compare all this to CPU Occlusion Culling, which while having many of its own problems, is also capable of completely eliminating draw calls and shader setups.

The takeaway here is that while having occlusion culling would be nice for ECS rendering, it is not as big of an optimization as people seem to believe it would be. And one should be wary about what other sacrifices may be made to have it.

GPU Dynamic Upload Culling

Alright. It is time to talk about a big optimization that Unity is missing. Some of you may remember a few months back I shared this video. It is time I explain one of the biggest pieces behind it.

https://youtu.be/AgcRePkWoFc

I want to focus on that very first scenario, the 1 million spinning cubes. I remember when I first shared this, it baffled some people. Some thought that this had to do with my custom transform system.

I’ve updated both test projects to Unity 6 LTS. Let’s look at the profiler for each.

Vanilla Unity ECS

Latios Framework

In both versions, you’ll notice there is the marker on the main thread titled Gfx.WaitForPresentOnGfxThread. That means we are GPU-bound. But why is vanilla Unity’s way worse?

The scenes are identical. The same camera, the same number of cubes visible, the same shadow settings, the same shaders on the cubes, it is all the same.

The transform systems being used are different. Unity has LocalTransform and LocalToWorld, the latter being 64 bytes in size. Whereas the Latios Framework has a single QVVS WorldTransform that is only 48 bytes in size? Perhaps the smaller size means less data sent to the GPU?

Not at all. Entities Graphics chops off the bottom row of each matrix, sending each as a float3x4 which is also 48 bytes. It is the same size per-transform uploaded.

But float3x4 is a different form than a QVVS. How does that get resolved?

Both Entities Graphics and the Latios Framework upload transforms and other material properties via a compute shader, known as the Sparse Uploader. This compute shader is responsible for receiving buffers full of new values, and scattering them into the persistent storage buffer. Hence the “sparse”. It will also compute inverses of matrices to avoid having to upload both the LocalToWorld and the WorldToLocal. The Latios Framework uses a modified version that can convert QVVS Transforms into float3x4 prior to the inverting and storage. So really, the Latios Framework should be slightly more expensive on the GPU.

While the transform systems may be different and impact CPU performance, they aren’t the reason for the GPU performance discrepancy. For that, we need to look at the jobs responsible for copying the transforms into the Sparse Uploader buffers.

Vanilla Unity ECS

Latios Framework

Notice that in the Latios Framework, the job runs later in the frame, and has a much smaller duration. The smaller duration comes from copying less data to the buffers. And if there is less data going into the buffers for upload, that means the GPU is receiving less data. Less data means less work, and that’s where the GPU is saving time.

The algorithm Entities Graphics uses for choosing what to upload is simple. If a material property has changed since last frame (LocalToWorld is considered a material property), then Entities Graphics uploads it. Because all 1 million cubes are rotating every frame, all 1 million transforms get uploaded every frame.

But most of the cubes are outside of the camera frustum, so most of this work is useless, since the next frame those transforms are outdated.

The Latios Framework keeps track of not only when material properties have changed, but also whether or not it has uploaded those material properties since they last changed. With this, and by only considering uploading chunks with entities that passed frustum culling in a given frame, it is able to reduce the amount of data sent. And yes, I do this at chunk granularity currently, though I think I could make this work at entity granularity in the future.

In practice, the algorithm Entities Graphics uses is great when most entities are static. However, in worlds densely populated with dynamic entities, this “keep the whole world in sync” approach can be quite expensive. This just so happens to be my use case. That’s why I use ECS.

Other game engines chase after the buzzword term “GPU-driven” in which the GPU starts to become responsible for choosing what to draw, including basic things like culling, LODs, and filtering. Ever since discovering the impact syncing the whole world has, I’ve become wary of these techniques. My concern is that Unity may be headed down this same GPU-driven path.

From what I can gather, future changes might prevent the ability to do this kind of granular change tracking, due to isolating the data to another thread. And because Unity already does the “sync-the-world” thing in both Entities Graphics and GPU Resident Drawer, anyone using those technologies wouldn’t know they are missing out. But I know, and hopefully by writing this post, you know too.

2X performance is a big deal!

Thanks for reading!

17 Likes

Thank you for raising concerns on this issue. It’s early in the day of U7 and I hope Unity really consider the opinions of other people, especially lowkey experts like you.

2 Likes

Thanks for the writeup and the ping! There is a lot to answer here.

However, these two higher-level technologies have not evolved with the same features. Entities Graphics has a powerful per-instance material override system. And GPU Resident Drawer has LOD Crossfade and GPU Occlusion Culling. Naturally, I think everyone wants to see a little more convergence of these features.

You are absolutely right. And I answered a few questions about that on Discord yesterday. I’m actively working on unifying GPU Resident Drawer & Entities Graphics for Unity Next. This is part of a general initiative to make Unity simpler and less fragmented, like advertised in the last keynote.

More details will come later when it’s all landed. But the basic idea is to have only one built-in system driving a BatchRendererGroup and handling both GameObject & Entities.

FWIW It’s an issue that is known since before GPU Resident Drawer was even released. But things take time because … Big companies are complicated.

The way GPU Occlusion Culling works, is that all candidate objects for rendering have their bounding boxes sorted and organized. And the meshes with the largest and closest bounding boxes relative to the camera will be rendered into an occlusion depth buffer. Then, a compute shader will test the bounding boxes of all these remaining instances, and mark them as visible or invisible in a GPU buffer.

Not sure if you’re simplifying a lot or being incorrect here. Just in case, GPU Occlusion Culling with GPU Resident Drawer in Unity is a double depth pass occlusion culling method.

First pass takes all the instances that passed CPU culling and compares their bounding spheres with previous frame depth buffer. If it’s occluded it’s put aside, if it’s visible we use it to build a depth pyramid.

Then we do a second depth pass taking the instances that were put aside, compare them with the depth pyramid and render the ones that pass the test.

It allows taking advantage of previous frame depth buffer and hardware Hi-Z culling. It is a very efficient method especially when you have dense meshes and good occluders. And FWIW this is a common technique in AAA GPU Driven renderers.

You’re right it doesn’t impact CPU perfs though. Same amount of draw calls, except they are indirect instead. This is indeed aiming at improving GPU perfs under certain scenarios, but not CPU perfs.

When both techniques are used together, what that means is that GPU Occlusion Culling is reducing the amount of vertices and triangles that get rasterized, but would have been caught and eliminated by the depth pre-pass anyways . This is why in many projects, GPU Occlusion Culling when combined with depth pre-pass usually only offers a few percentage points of improvements.

There might be some inefficiencies that can be fixed in later version here. I’ve also heard many users reporting they didn’t get much improvements with GPU Occlusion Culling, even in very favorable scenarios.

I think it’s likely some performances bugs or issues that can be fixed in general. It’s just a bit slow to prioritize because resources are scarce currently. But I don’t think the approach itself is bad.

To give you some context, the person who originally wrote the GPU Occlusion Culling system has left the company. And in general, many people left or got laid off.

The current situation is that I am basically alone owning GPU Resident Drawer, as well as the Culling area. Those two things used to have dedicated teams in the past. And I’m also the main person working on unifying GPU Resident Drawer & Entities Graphics.

So there is a lot to do, and not a lot of resources. We’re trying to get new people onboarded, but that’s a slow process. Although I do hope and think GPU Occlusion Culling will get better when Unity 6 matures.

From what I can gather, future changes might prevent the ability to do this kind of granular change tracking, due to isolating the data to another thread. And because Unity already does the “sync-the-world” thing in both Entities Graphics and GPU Resident Drawer, anyone using those technologies wouldn’t know they are missing out.

GPU Driven rendering does not mean rendering without doing anything on the CPU. You can be GPU Driven while still having CPU frustum & occlusion culling for example.

I think the kind of optimization you mentioned, like not uploading data to the GPU if instances are not visible is interesting. It’s a bit more complicated than that though, because GPU instance data like transforms can be needed even if not visible by the main camera. E.g. for shadow maps, ray tracing or custom vertex shaders for which it can be hard to have proper bounding boxes.

But I don’t see why it couldn’t be integrated to Entities Graphics or GPU Resident Drawer. Rendering does not have to be all CPU or all GPU, it can (and IMO should) be a mix of both.

9 Likes

I was oversimplifying, or rather providing a basic overview of a technique for occlusion culling, just enough to establish the point about draw calls. I do appreciate you providing a much more accurate explanation though! I am familiar with the technique you describe.

I 100% agree with you here. My point was that it is beneficial, and it is very safely beneficial (it rarely hurts performance, in contrast to CPU occlusion culling techniques). But the maximum potential benefit is quite limited on average. I’m oversimplifying here again, but if your frame times broke down to 1 ms pre-pass, 6 ms forward+, and 4 ms post-process, then I believe GPU Occlusion Culling would only be able to save up to 2 ms.

I’m sorry to hear this. But thank you for being honest. That’s a lot of pressure, and if there are ways some of us in the community can be helpful, know that we are eager.

In any case, I wish you the best!

Indeed, I left out some details in my implementation, but I do account for shadow maps and vertex shaders. Though I will admit I haven’t thought much about raytracing, largely because I don’t think BRG supports it.

And yes, I think this could be integrated into GPU Resident Drawer or whatever future merged solution you are working on will be called. But that sticky point is that for this optimization to work, you need direct access to the ECS data after culling. That door could get shut if you moved culling to a separate thread to do n-1. But if you pushed culling and sync as early in the process as possible on the main thread, and then offloaded to the render thread right after, then I think you get the best of everything. And I would be in full support of such a solution!

But again, all I was trying to do was bring awareness that this optimization exists and has a major impact. So thank you so much for hearing me out and even replying! That means a lot!

2 Likes

But the maximum potential benefit is quite limited on average. I’m oversimplifying here again, but if your frame times broke down to 1 ms pre-pass, 6 ms forward+, and 4 ms post-process, then I believe GPU Occlusion Culling would only be able to save up to 2 ms.

I would say it’s very content dependent. Some scenes should really like it, but some definitely not that much.

I’m sorry to hear this. But thank you for being honest. That’s a lot of pressure, and if there are ways some of us in the community can be helpful, know that we are eager.

In any case, I wish you the best!

Oh no worries about the pressure! Honestly it’s been completely fine, we don’t do crazy hours. We just do a bit less until we get new people onboard.

But that sticky point is that for this optimization to work, you need direct access to the ECS data after culling. That door could get shut if you moved culling to a separate thread to do n-1

Oh right you were referring to that. I would say don’t worry too much about it for now. Those things are far in the future, probably not even Unity Next timeline. Just internal discussions and intent for now. In any case we will ask for feedback from the community when the time come. Whatever we end up with should definitely allow for such optimizations in a way or another I think. Thanks for bringing it up!

3 Likes

Please it would be really nice to get some of this information about the GPU culler added in the docs!

I was wondering if there’s any way I for us to possibly hook our own custom systems (e.g. terrain or grass systems) to this GPU instancing rendering path?

I have this Terrain system I’m working on which generates all its geometry on the GPU, and draws it with Graphics.RenderInstanced

  1. The chunks of the terrain are generated in compute shaders of the maximum possible size.
  2. I allocate a geometry buffer (for both quads and vertices) that will contain all possible terrain geometry. I divide this buffer into sections, called meshlets of about 512 quads.
  3. Using the amount of quads generated, I allocate on the GPU N number of meshlets from which I copy my generate geometry onto. I also generated an AABB for my meshlets during copy.

This way I can have variable size meshes generated at runtime on the GPU, but now I have to draw all of them.

I’m not sure how I’m supposed to implement HiZ culling on my meshlets. I would like certain meshlets to be able to cull others. I would also want my meshlets to cull other GameObjects, or my grass system.

Here’s my questions

  1. Why bounding sphere and not AABB comparing to the previous frame depth buffer?
  2. Are you comparing to the depth buffer directly, or the previous frame’s depth pyramid?
  3. Why not have a visibility bit buffer instead of comparing against the previous frame?
  4. Is there any way to hook my system to yours and have my meshlets rendered on the depth pyramid?
  5. What’s stopping unity from following the GPU-rendering approaches we already had 10 years ago of using geometry clusters? This seems like it would allow large meshes to also benefit from GPU culling instead of just the small instances. Of course you’d replace Ubisoft’s culling solution with the far superior two pass culling method.