Hello. I’m working on a compute shader that operates over a mesh with 1025x1025 vertices. I would be grateful if someone could help me understand performance issues that I’m having and possibly suggest an optimization or a workaround for this issue. (I’m working in Unity version 2022.3.13f1 and URP)
Firs I set all buffers and dispatch the shader with 32,32,1 group setting. (I will refer to this as frame 0)
private void CreateTerrain()
{
TerrainHeightMapBuffer.SetData(heights); //This a float[] with a height for each vertex
// update vertex positions
ComputeTerrainMesh.SetBuffer(0, kShaderVertBuffer, _gpuVertices); //_gpuVertices is a _mesh.GetVertexBuffer(0). The mesh is using IndexFormat.UInt32
ComputeTerrainMesh.SetBuffer(0, kShaderPhysicsVertBuffer, _gpuPhysicsVertices); //is a new GraphicsBuffer(GraphicsBuffer.Target.Raw, kChunkVertCount * kChunkCount, sizeof(float) * 6) that hold data for all chunks of the "master" mesh (some verts are repeated to create a memory stream where all vertices for each chunk are in a row. This makes it easier to assign to the chunk meshes later on
ComputeTerrainMesh.SetBuffer(0, kShaderHeightMapBuffer, TerrainHeightMapBuffer); //This a float[] with a height for each vertex
ComputeTerrainMesh.SetBuffer(0, kShaderChunkChangedFlagsBuffer, ChunkChangedFlagsBuffer); //This is an uint[] that holds a number of changed vertices in each chunk mesh
ComputeTerrainMesh.Dispatch(0, kNumberOfGroups, kNumberOfGroups, 1); //Dispath vertex and chunk calculations
// calculate normals
ComputeTerrainMesh.SetBuffer(1, kShaderVertBuffer, _gpuVertices); //Same mesh buffer as before
ComputeTerrainMesh.Dispatch(1, kNumberOfGroups, kNumberOfGroups, 1); //Dispath normalcalculations
AsyncGPUReadback.Request(_gpuVertices, OnCompleteReadBack);
AsyncGPUReadback.Request(_gpuPhysicsVertices, OnPhysicsComplete);
}
Calling this method takes ~1.7ms on my machine.
In the shader I use RWByteAddressBuffers and [numthreads(32, 32, 1)] for both Kernels. (I believe the content of the shader is not relevant to the issue so I will not be posting it here.)
Then after 3 frames (on frame 3)from calling the CreateTerrain method Gfx.UpdateAsyncReadbackData is called which takes ~20ms on the Render Thread which stalls the cpu for ~17ms on the frame 4.
Other than that nothing else happens in frame 4.
In frame 5 in the EarlyUpdate the AsyncGPUReadback callbacks are executed.
private void OnCompleteReadBack(AsyncGPUReadbackRequest request)
{
if(!request.done)
{
return;
}
Profiler.BeginSample("GetData and Set");
vertData = request.GetData<VertexData>();
_terrainVisualMesh.MarkDynamic(); //Marking the mesh dynamic as it will be changed often under some circumstances
_terrainVisualMesh.SetVertexBufferData(vertData, 0, 0, vertData.Length); //Set the entire 1025x1025 buffer to the mesh
Profiler.EndSample();
}
private void OnPhysicsComplete(AsyncGPUReadbackRequest request)
{
if (!request.done)
{
return;
}
ChunkChangedFlagsBuffer.GetData(_chunkChangedFlags);
Profiler.BeginSample("GetPhysicsData and Set");
vertData = request.GetData<VertexData>();
for (int i = 0; i < _chunkMeshes.Length; i++)
{
_chunkMeshes[i].MarkDynamic(); //Marking the mesh dynamic as it will be changed often under some circumstances
_chunkMeshes[i].SetVertexBufferData(vertData, i * kChunkVertCount, 0, kChunkVertCount); //Setting a part of the buffer which contains ordered vertices for current chunk
}
//Later in the code I use Physics.BakeMesh to apply these meshes to MeshColliders. This also takes a long time and I will likely make a different post about it. I'm using C# Tasks for async baking right now, but if the baking will cause trouble even in unity job system I will make a post about it.
Profiler.EndSample();
}
The GetData and Set takes ~3.5 ms and the GetPhysicsData and Set takes ~3ms
The frame 5 is the beginning of the “mount doom” of Semaphore.WaitForSignal which takes ~56 ms. InitializeBuffer is taking all this time on the RenderThread. I presume the larger init is for the large buffer (_gpuPhysicsVertices) and the other is for the smaller one (_gpuVertices) even though the difference between the two should not be as large. What’s more confusing to me are the little chunks of InitializeBuffer at the end of the timeline. I don’t know what those are.
Frame 6 is just the large spike where nothing much happens other than waiting for render thread and “LargeAllocation.Free”
Frame 7 and frame 8 there is nothing going on (as I would expect, since all the hard work is done already) other than editor work which doesn’t concern me
Then in frames 9,11-15,17-22 there is again Semaphore This time Gfx.WaitForPresentOnGfxThread.
Some questions I have.
- Is there a way to improve the performance of this?
- Am I doing something strictly wrong?
- Can and Should I split the workload?
- I would expect everything would be done after frame 5 where the “GetData” and “SetVertexBufferData” are called.
Any tips to improve performance are appreciated.