I’m developing Mesh Cluster Culling on CPU side for mobile(mobile platform GPU can’t afford SSBO). Now I split mesh into many small parts by using submesh, after the culling is done, I need rendering the visible submesh list.
If I use Graphics.DrawMesh to draw them one by one, the cost of calling Graphics.DrawMesh is too high. So I need combine the visible continuous submesh into a single submesh or drawcall.
I found DrawMeshInstancedIndirect could set the indexbuffer start position and index count which could achieve what I need. But it doesn’t benefit from SRP batcher.
So my question is whether there is a API like DrawMesh(Mesh mesh, int subMeshIndexStart, int subMeshIndexEnd)?
The problem with mesh cluster culling on the CPU side is exactly what you are pointing out. You need some way to send the index data to the GPU every frame. You can certainly use something like Unity - Scripting API: Graphics.DrawProcedural (the second option with the index buffer parameter) and just copy your indices to that GraphicsBuffer each frame, but I think that might be too slow for your purposes (as it can potentially be a lot of index copying). Are you sure SSBOs are too slow for your use? Maybe in your case the cluster culling will offset the cost.
EDIT : Also not sure how you will do this even with an index buffer as you still need to index into a vertex array, so you will still have to use an SSBO, unless you just have a massive array of all your world position vertices. But that will be a huge memory cost unless you have very small maps.
I’m very sure SSBO is much slower than the VBO on SnapDragon GPU, the reason is SSBO data fetch is converted into Texture fetch and GLES 3.x platform only fetch 4bytes(one integer), so if I need fetch an float4, then there is 4 Texture Fetchs, too much Vertex Texture Fetch will significantly slow down Vertex shader compare to the normal VBO or UBO usage in Vertex shader.
I also run the snapdragon profiler to check the reason and see the much higher Texture Fetch Stall value and also confirm it in the official document about the SSBO performance.
In my test case, I use DrawMeshIndirect to draw large amount of grass(20k+), traditional GPU instance is significant faster than DrawMeshIndirect with fetch float4x4 in vertex shader, so I think fetch vertex data(pos, normal, uv, color…) will more slower than only one float4x4.
After I digging out some days, I found I could still use Indirect draw, using Graphics.RenderMeshIndirect(available since 2021.2) instread of DrawMeshIndirect, because the new API is much more faster for large drawcalls(10k+, only tested on Editor right now). Combine the indirect draw args buffer in compute shader could combine multi submesh drawcalls into one drawcall, now I’m working on it and hope it could work.
Although it’s under Vulkan scetion, but I tested it and also apply to GLES 3.x. It said “Reading data from SSBOs or Image buffers effectively become texture fetches so performance/latency would be similar.” and I show the compiled code in Unity to confirm it fetch one integer per texture fetch.
Interesting. It mentions binning which is considered an older technique for tiled rendering. Wonder how up to date that is. I know at least modern Apple mobile GPUs haven’t used binning in a while.
At least the Snapdragon 855 and 865 is still using binning according to the snapdragon profiler’s render state capture. Maybe 8+ gen 1 will be different I guess, but we can’t only target the latest GPU and what’s more, many people still using Snapdragon 660 which is very old.
You can refer to this plugin, which culls a large number of grasses using a Compute Shader and writes them into a texture, then finally uses the texture for rendering in the Vertex Shader. This way, you don’t need to use SSBO.