Hello everyone! I hope you are all having a good week!
We recently had the Graphics Office Hours here, and during that time, some things were revealed about the future of ECS rendering. While most people I talked to seemed to be in favor, a lot of the reasons proposed contradicted my own experiences. I, myself, am very concerned that Unity might be making a big misstep.
The purpose of this post first-and-foremost is to educate. I in no way want to discredit the Unity engineers or decision-makers. I suspect very few people in the graphics industry know what I will be sharing. In fact, I discovered this by accident in an attempt to solve a different problem. While I will try to explain this in understandable language, I suspect this will likely go over many people’s heads. @VincentBreysse I saw you were answering many questions related to the future of ECS Rendering, and I suspect you may find this interesting. I hope you don’t mind the ping.
Also, as a disclaimer, it is very possible that there are things I am wrong about. I am eager to learn!
Background
I’ll first explain some of the concepts behind Unity’s high-performance rendering technologies. For those already well-familiarized, you may skip this section.
For a more complete background, I strongly suggest watching this video: https://www.youtube.com/watch?v=6LzcXPIWUbc
Unity has two high-performance rendering technologies, those being the Entities Graphics package, and the GPU Resident Drawer. These two technologies can coexist, but do not interact with each other in any way. Entities Graphics handles entities. GPU Resident Drawer handles Game Objects. Both of these technologies are backed by a low-level API known as BatchRendererGroup
(BRG for short).
BRG combines both instancing and batching. The idea is that you can store many objects’ rendering data persistently in the GPU across multiple frames. However, unlike traditional instancing in Unity, you don’t have to rearrange and compact arrays to just what is supposed to be rendered. Instead, you specify which instances you want to draw, as well as their configurations, and BRG will take care of the rest. This allows other features like LODs and culling to participate in choosing which instances get drawn. And all of this gets jobified for maximum performance.
BRG was first rolled out for Entities Graphics, but it was then discovered that it could be used to make Game Object rendering faster too. And that’s how the GPU Resident Drawer became a thing. However, these two higher-level technologies have not evolved with the same features. Entities Graphics has a powerful per-instance material override system. And GPU Resident Drawer has LOD Crossfade and GPU Occlusion Culling. Naturally, I think everyone wants to see a little more convergence of these features. However, I encourage you all to not just consider the benefits, but also the costs.
A Warm-Up on GPU Occlusion Culling
This isn’t really what I want to talk about, but I worry that if I don’t, people will try to counter-argue using misinformation.
Not all occlusion culling is created equal!
Unity is a shader-heavy engine. Unity shaders generate many different variants. But besides the ones Unity provides out-of-the-box, there are tools like Shader Graph which generate even more shaders and variants, plus all the third-party assets that also ship with their own shaders. Juggling all these shaders and their properties and bindings incurs costs. These costs are paid for during the setup of draw calls. And while the SRP Batcher tries really hard to reduce this, there’s still significant costs to this in a production project.
The way GPU Occlusion Culling works, is that all candidate objects for rendering have their bounding boxes sorted and organized. And the meshes with the largest and closest bounding boxes relative to the camera will be rendered into an occlusion depth buffer. Then, a compute shader will test the bounding boxes of all these remaining instances, and mark them as visible or invisible in a GPU buffer.
What is important here is that while a GPU buffer contains info about which instances are visible, the CPU is still responsible for commanding the draw call. It can say “draw whichever instances in this buffer are visible once the compute shader is done”. However, GPU Occlusion Culling does NOT reduce draw calls! It does not eliminate the shader setup and all the variable bindings. All those costs still exist. Instead, what it is reducing are the number of triangles that are to be rasterized.
When rendering a scene, most of the cost falls into texturing, lighting, and post-processing. Vertex processing and rasterization, while not completely free, usually are not as heavy (this assumes your LODs are setup correctly to avoid the tiny triangle overdraw problem). If you do these calculations more than once per pixel (overdraw), that can add up. This is why many rendering engines opt to use a “depth pre-pass”. What this does, is rasterize all the geometry once without any of the complex texturing and lighting calculations to set up the depth buffer, and then re-rasterize all the geometry again, but this time only processing the pixels on objects that end up at the same depth level as the depth buffer. These are the pixels known to actually be visible, where texturing and lighting will matter.
The same hardware that is good with indirect drawing used by occlusion culling is the same hardware that tends to be good at and benefits more from depth pre-pass. And if it isn’t obvious, depth pre-pass is typically the stronger optimization of the two.
When both techniques are used together, what that means is that GPU Occlusion Culling is reducing the amount of vertices and triangles that get rasterized, but would have been caught and eliminated by the depth pre-pass anyways. This is why in many projects, GPU Occlusion Culling when combined with depth pre-pass usually only offers a few percentage points of improvements. It is worth using if you have it, as it is a very safe optimization. But it is not optimizing the slowest part of rendering. It tends to be especially good when you have lots of small instances, such as vegetation, or for when your vertex shader is more complex, such as water or GPU animation textures. It would also be great for skinned meshes if it weren’t for the fact Unity opts to use compute shaders for skinning. There are definitely some projects out there that will see massive gains with GPU Occlusion Culling. But this isn’t the norm.
Compare all this to CPU Occlusion Culling, which while having many of its own problems, is also capable of completely eliminating draw calls and shader setups.
The takeaway here is that while having occlusion culling would be nice for ECS rendering, it is not as big of an optimization as people seem to believe it would be. And one should be wary about what other sacrifices may be made to have it.
GPU Dynamic Upload Culling
Alright. It is time to talk about a big optimization that Unity is missing. Some of you may remember a few months back I shared this video. It is time I explain one of the biggest pieces behind it.
I want to focus on that very first scenario, the 1 million spinning cubes. I remember when I first shared this, it baffled some people. Some thought that this had to do with my custom transform system.
I’ve updated both test projects to Unity 6 LTS. Let’s look at the profiler for each.
Vanilla Unity ECS
Latios Framework
In both versions, you’ll notice there is the marker on the main thread titled Gfx.WaitForPresentOnGfxThread. That means we are GPU-bound. But why is vanilla Unity’s way worse?
The scenes are identical. The same camera, the same number of cubes visible, the same shadow settings, the same shaders on the cubes, it is all the same.
The transform systems being used are different. Unity has LocalTransform
and LocalToWorld
, the latter being 64 bytes in size. Whereas the Latios Framework has a single QVVS WorldTransform
that is only 48 bytes in size? Perhaps the smaller size means less data sent to the GPU?
Not at all. Entities Graphics chops off the bottom row of each matrix, sending each as a float3x4
which is also 48 bytes. It is the same size per-transform uploaded.
But float3x4
is a different form than a QVVS. How does that get resolved?
Both Entities Graphics and the Latios Framework upload transforms and other material properties via a compute shader, known as the Sparse Uploader. This compute shader is responsible for receiving buffers full of new values, and scattering them into the persistent storage buffer. Hence the “sparse”. It will also compute inverses of matrices to avoid having to upload both the LocalToWorld
and the WorldToLocal
. The Latios Framework uses a modified version that can convert QVVS Transforms into float3x4
prior to the inverting and storage. So really, the Latios Framework should be slightly more expensive on the GPU.
While the transform systems may be different and impact CPU performance, they aren’t the reason for the GPU performance discrepancy. For that, we need to look at the jobs responsible for copying the transforms into the Sparse Uploader buffers.
Vanilla Unity ECS
Latios Framework
Notice that in the Latios Framework, the job runs later in the frame, and has a much smaller duration. The smaller duration comes from copying less data to the buffers. And if there is less data going into the buffers for upload, that means the GPU is receiving less data. Less data means less work, and that’s where the GPU is saving time.
The algorithm Entities Graphics uses for choosing what to upload is simple. If a material property has changed since last frame (LocalToWorld is considered a material property), then Entities Graphics uploads it. Because all 1 million cubes are rotating every frame, all 1 million transforms get uploaded every frame.
But most of the cubes are outside of the camera frustum, so most of this work is useless, since the next frame those transforms are outdated.
The Latios Framework keeps track of not only when material properties have changed, but also whether or not it has uploaded those material properties since they last changed. With this, and by only considering uploading chunks with entities that passed frustum culling in a given frame, it is able to reduce the amount of data sent. And yes, I do this at chunk granularity currently, though I think I could make this work at entity granularity in the future.
In practice, the algorithm Entities Graphics uses is great when most entities are static. However, in worlds densely populated with dynamic entities, this “keep the whole world in sync” approach can be quite expensive. This just so happens to be my use case. That’s why I use ECS.
Other game engines chase after the buzzword term “GPU-driven” in which the GPU starts to become responsible for choosing what to draw, including basic things like culling, LODs, and filtering. Ever since discovering the impact syncing the whole world has, I’ve become wary of these techniques. My concern is that Unity may be headed down this same GPU-driven path.
From what I can gather, future changes might prevent the ability to do this kind of granular change tracking, due to isolating the data to another thread. And because Unity already does the “sync-the-world” thing in both Entities Graphics and GPU Resident Drawer, anyone using those technologies wouldn’t know they are missing out. But I know, and hopefully by writing this post, you know too.
2X performance is a big deal!
Thanks for reading!