Hey,
I’ve recently been working on implementing a 'GPU Rendering Pipeline’, aiming to maximize GPU utilization and minimize Draw Calls through indirect methods. However, I’m lacking some knowledge and have a few questions about setting up certain aspects.
Idk how to paste video so here`s gif with current setup: Screen capture - 97706ec1ad44bdaf85a22d47489b0bbe - Gyazo
Current Setup:
- All drawing data (positions, visibility indexes, etc.) is stored in a few large buffers created at startup.
- Compaction is performed once for all objects. We compact all visible instances, sort them by batch ID, count each batch size, and then consolidate these sizes into a large indirect command buffer. This buffer is used with Graphics.RenderMeshIndirect() with the appropriate command offset. This approach is similar to what’s described here: GPU-Driven Engines.
- LOD Picking and Frustum Culling are done per prototype object. I initially used Unity append buffers with counters but switched to PrefixSum Compaction for better stability (as in this DX12 implementation: GPUPrefixSums).
- System works with dynamic placement similar to one presented in Horizon Zero Dawn and also with renderer component attach to GameObject.
- All is render using Graphics.RenderMeshIndirect(), as only this seem to properly support velocity buffer in HDRP
Questions:
-
Handling Dynamic Position/Visibility Buffers: How should large buffers for positions/visibility be managed when they might change each frame? I’m currently using a large buffer for, say, 5 million objects as a maximum count. However, this approach is not always efficient, leading to over-allocation or the need to resize the Mesh-Pass buffer when adding a few objects.
-
Buffer Organization: Is it better to use large shared buffers or to divide data/computation by mesh pass/prototype/batch? While processing many objects at once seems faster, it requires additional steps. For instance, I currently perform frustum culling per prototype/object, which allows easy scaling in the compute shader. A more ‘global’ approach would necessitate extra buffers for object references and bounding box data, potentially leading to frequent updates and larger buffer sizes, which seems inefficient.
-
Using Prefix Sum with Bit Masks: I’m exploring the use of a 32-bit uint as a visibility index, which may be overkill since only 1 bit is needed to determine visibility. Theoretically, we could store the visibility of 32 objects in one uint, significantly reducing the data needed for ‘visibility’ buffers. However, integrating this with prefix sum compaction appears challenging.
-
DX12 . Found many algorithms to speed up procesing by using WaveIntrinsics which are only avaible in DX12 in Unity , hoower performance of DX12 especially with HDRP remains a mystery and from what I read, it can often be much worse then DX11
Any suggestions and ideas or architecture references are greatly appreciated. Thanks!"