GPU Driven Rendering

Hey,

I’ve recently been working on implementing a 'GPU Rendering Pipeline’, aiming to maximize GPU utilization and minimize Draw Calls through indirect methods. However, I’m lacking some knowledge and have a few questions about setting up certain aspects.

Idk how to paste video so here`s gif with current setup: Screen capture - 97706ec1ad44bdaf85a22d47489b0bbe - Gyazo

Current Setup:

  • All drawing data (positions, visibility indexes, etc.) is stored in a few large buffers created at startup.
  • Compaction is performed once for all objects. We compact all visible instances, sort them by batch ID, count each batch size, and then consolidate these sizes into a large indirect command buffer. This buffer is used with Graphics.RenderMeshIndirect() with the appropriate command offset. This approach is similar to what’s described here: GPU-Driven Engines.
  • LOD Picking and Frustum Culling are done per prototype object. I initially used Unity append buffers with counters but switched to PrefixSum Compaction for better stability (as in this DX12 implementation: GPUPrefixSums).
  • System works with dynamic placement similar to one presented in Horizon Zero Dawn and also with renderer component attach to GameObject.
  • All is render using Graphics.RenderMeshIndirect(), as only this seem to properly support velocity buffer in HDRP

Questions:

  • Handling Dynamic Position/Visibility Buffers: How should large buffers for positions/visibility be managed when they might change each frame? I’m currently using a large buffer for, say, 5 million objects as a maximum count. However, this approach is not always efficient, leading to over-allocation or the need to resize the Mesh-Pass buffer when adding a few objects.

  • Buffer Organization: Is it better to use large shared buffers or to divide data/computation by mesh pass/prototype/batch? While processing many objects at once seems faster, it requires additional steps. For instance, I currently perform frustum culling per prototype/object, which allows easy scaling in the compute shader. A more ‘global’ approach would necessitate extra buffers for object references and bounding box data, potentially leading to frequent updates and larger buffer sizes, which seems inefficient.

  • Using Prefix Sum with Bit Masks: I’m exploring the use of a 32-bit uint as a visibility index, which may be overkill since only 1 bit is needed to determine visibility. Theoretically, we could store the visibility of 32 objects in one uint, significantly reducing the data needed for ‘visibility’ buffers. However, integrating this with prefix sum compaction appears challenging.

  • DX12 :frowning: . Found many algorithms to speed up procesing by using WaveIntrinsics which are only avaible in DX12 in Unity , hoower performance of DX12 especially with HDRP remains a mystery and from what I read, it can often be much worse then DX11

Any suggestions and ideas or architecture references are greatly appreciated. Thanks!"

1 Like

Been rolling my own version of this stuff, here’s what I’m doing:

For any given object set, a list of Matrix4x4 is passed to a compute shader.

The output of this compute shader is 8 structured buffers of type uint, with an index into the matrix4x4 array packed into 24 bits, with the remaining 8 bits being used to pack the -1 to 1 LOD cross fade value. These structured buffers represent up to 4 lod meshes, and objects which need to be drawn for shadows and objects that need to be drawn for visibility.

The compute shader does frustum and HI-Z culling, for both visibility and potential shadow visibility, which is computed by projecting the bounds along the main light direction.

The list of Matrix4x4 is converted into a list of packed data:

struct PackedTransform
{
    float3 position;
    uint3 quatScale; // quaternion and scale packed into 16 bits each
}

The quat is packed taking advantage of the -1 to 1 values of the quaternion, and the scale using half precision.

This is then drawn for each submesh/material combination in each LOD, with and without shadows. In the shader, it uses the index to look up the right packed matrix and reconstruct it.

Draws get preculled on the CPU in a job, depending on what they are. For instance, terrain details are already broken into patches, so I stream them in and out and do a frustum cull on the patches in a job. This means I only need to draw about 20 patches per terrain (with a detail distance of 100 meters) even though a terrain has 1024 patches by default.

When I started, I used the old Vegetation Studio format for this stuff - but that required ~125 megs of buffers for 100k instances. With this new format, I’m down to 2.5mb instead, with each instance being 24bytes of data and 4 bytes for the indirection buffer. Note that I currently don’t do any kind of combining of buffers - so those 20 patches rendering grass would be 40 draw calls (one for shadows, one for visibility, assuming no LODs). Tree’s and such are drawn once per terrain, which doesn’t make sense to optimize because I want per-terrain data to be availble for them.

What I want to explore going forward:

  • Combining patches to reduce total draw calls on the grass
  • Is there any advantage to combining mesh data from multiple objects/lods? It seems like you still need to submit one call per mesh/material combination regardless of if they are in the same buffer or not.
  • Is it worth it to only convert the matrix4x4 → packed transforms on things that pass the visibility buffer? I could pass in all the visibility buffers, zip though them, and only convert the data on visible instances. However, currently things can appear in buffers more than once (lod crossfade), so somethings would be transformed more than once. And since buffers are not combined, this ends up adding a dispatch for each list per frame, so not sure if this is the way or not.
5 Likes

This is impressive. I’ve already used 3x4 matrices but never considered compressing them this much. This changes a lot. Thanks!

Perhaps my attempt to draw all similar meshes in a single draw call is a misguided approach and not worth it. I need to test this, but a combination of precomputed CPU culling and well-organized patches/chunks might suffice. This also eliminates the problem of sharing one large buffer for all instances.
One think that I like from Horizon Zero Dawn is the dynamic scattering of vegetation as the camera moves, keeping the buffer size for all these instances constant and allowing for a single draw call per submesh. In the GIF, I’ve highlighted ‘chunks’ responsible for scattering new vegetation positions. This is also quite efficient since I only refresh new chunks and only if they change positions. However, this requires additional work on density and weight maps.

I haven’t measured this precisely, but I’ve noticed that it can significantly reduce the number of SetPassCalls. Each submesh still requires a separate draw call, but from what I’ve seen in renderdoc, the data required for drawing (such as mesh, texture array, etc.) is only sent during the first call.". https://discussions.unity.com/t/886264 #:~:text=single%20DrawProceduralIndirect%20call.-,Thanks%20for%20your%20answer.%0APlease%20note%20that%20i%20am%20talking,a%20look%20at%20the%20attached%20screenshots%20to%20see%20what%20i%20mean,-First%20screenshot%20shows

2 Likes

The bet is that bandwidth < ALU on most GPUs these days. Note that you can pack a Matrix3x4 into 30 bytes using half precision for the rotation/scale and needing less ALU for the unpacking, but as this would need to use traditional half precision you’d get more quantization in the quaternion - likely fine for a lot of stuff though. The nice thing about the quaternion packing is that quaternions are -1 to 1, so you can pack taking advantage of that and get 65k values between -1 and 1 which is very high precision, and it’s slightly smaller at 24 bytes.

The main reasons I’m writing this system is for flexibility in rendering I can’t get from Unity- basically being able to pass per-terrain data to detail meshes, and render details on meshes and such as well. But since some users will want an exact match with Unity terrain, I read the pacth data from it instead of doing my own scatter. (I might add my own option eventually, as the Unity terrain API is terrible and allocates an [ ] for this data instead of just returning a pointer to it’s internal version as a NativeArray or even taking a preallocated list). So that basically forces me into using chunks as I need to load them in and out from unity terrain as you move around, as loading all of a terrain is like 20mb of allocation.

But, I am considering combining the chunks on the rendering end. Since the indirection buffer is just 4 bytes an instance, it should be very fast to append them all together. If your rendering, say, 5 detail objects in an area and 20 chunks need to be loaded, that reduces draw calls from 100-20. And you only need to do this when chunks are being loaded in or out, not every frame. Another thing worth experimenting with is culling chunks with Hi-Z culling in a pass before culling each instance, as this would eliminate the need to do culling on lots of individual objects - but this might be overkill as it only helps best case scenerios where you have lots of occlusion.

So yeah, if I was generating everything at draw time, I’d likely not use chunks because you can just generate everything you need within view and be done with it.

Ah, so your combing materials by combing the textures into an array and requiring they use the same shader/variant? Is that providing an advantage given the content is usually 1 mesh → 1 material anyway and you have to draw for each mesh?