Compute Shader. Lag spikes after AsyncGPUReadbackRequest GetData

Hello. I’m working on a compute shader that operates over a mesh with 1025x1025 vertices. I would be grateful if someone could help me understand performance issues that I’m having and possibly suggest an optimization or a workaround for this issue. (I’m working in Unity version 2022.3.13f1 and URP)

Firs I set all buffers and dispatch the shader with 32,32,1 group setting. (I will refer to this as frame 0)

private void CreateTerrain()
    {
        TerrainHeightMapBuffer.SetData(heights); //This a float[] with a height for each vertex

        // update vertex positions
        ComputeTerrainMesh.SetBuffer(0, kShaderVertBuffer, _gpuVertices); //_gpuVertices is a _mesh.GetVertexBuffer(0). The mesh is using IndexFormat.UInt32
        ComputeTerrainMesh.SetBuffer(0, kShaderPhysicsVertBuffer, _gpuPhysicsVertices); //is a new GraphicsBuffer(GraphicsBuffer.Target.Raw, kChunkVertCount * kChunkCount, sizeof(float) * 6) that hold data for all chunks of the "master" mesh (some verts are repeated to create a memory stream where all vertices for each chunk are in a row. This makes it easier to assign to the chunk meshes later on
        ComputeTerrainMesh.SetBuffer(0, kShaderHeightMapBuffer, TerrainHeightMapBuffer); //This a float[] with a height for each vertex
        ComputeTerrainMesh.SetBuffer(0, kShaderChunkChangedFlagsBuffer, ChunkChangedFlagsBuffer); //This is an uint[] that holds a number of changed vertices in each chunk mesh
        ComputeTerrainMesh.Dispatch(0, kNumberOfGroups, kNumberOfGroups, 1); //Dispath vertex and chunk calculations

        // calculate normals
        ComputeTerrainMesh.SetBuffer(1, kShaderVertBuffer, _gpuVertices); //Same mesh buffer as before
        ComputeTerrainMesh.Dispatch(1, kNumberOfGroups, kNumberOfGroups, 1); //Dispath normalcalculations

        AsyncGPUReadback.Request(_gpuVertices, OnCompleteReadBack);
        AsyncGPUReadback.Request(_gpuPhysicsVertices, OnPhysicsComplete);
    }

Calling this method takes ~1.7ms on my machine.
9466316--1330124--upload_2023-11-12_13-46-47.png

In the shader I use RWByteAddressBuffers and [numthreads(32, 32, 1)] for both Kernels. (I believe the content of the shader is not relevant to the issue so I will not be posting it here.)

Then after 3 frames (on frame 3)from calling the CreateTerrain method Gfx.UpdateAsyncReadbackData is called which takes ~20ms on the Render Thread which stalls the cpu for ~17ms on the frame 4.


Other than that nothing else happens in frame 4.
In frame 5 in the EarlyUpdate the AsyncGPUReadback callbacks are executed.

    private void OnCompleteReadBack(AsyncGPUReadbackRequest request)
    {
        if(!request.done)
        {
            return;
        }

        Profiler.BeginSample("GetData and Set");

        vertData = request.GetData<VertexData>();

        _terrainVisualMesh.MarkDynamic(); //Marking the mesh dynamic as it will be changed often under some circumstances
        _terrainVisualMesh.SetVertexBufferData(vertData, 0, 0, vertData.Length); //Set the entire 1025x1025 buffer to the mesh

        Profiler.EndSample();
    }

    private void OnPhysicsComplete(AsyncGPUReadbackRequest request)
    {
        if (!request.done)
        {
            return;
        }

        ChunkChangedFlagsBuffer.GetData(_chunkChangedFlags);

        Profiler.BeginSample("GetPhysicsData and Set");
        vertData = request.GetData<VertexData>();

        for (int i = 0; i < _chunkMeshes.Length; i++)
        {
            _chunkMeshes[i].MarkDynamic(); //Marking the mesh dynamic as it will be changed often under some circumstances
            _chunkMeshes[i].SetVertexBufferData(vertData, i * kChunkVertCount, 0, kChunkVertCount); //Setting a part of the buffer which contains ordered vertices for current chunk
        }
//Later in the code I use Physics.BakeMesh to apply these meshes to MeshColliders. This also takes a long time and I will likely make a different post about it. I'm using C# Tasks for async baking right now, but if the  baking will cause trouble even in unity job system I will make a post about it.
        Profiler.EndSample();

    }

The GetData and Set takes ~3.5 ms and the GetPhysicsData and Set takes ~3ms
9466316--1330139--upload_2023-11-12_14-13-27.png
The frame 5 is the beginning of the “mount doom” of Semaphore.WaitForSignal which takes ~56 ms. InitializeBuffer is taking all this time on the RenderThread. I presume the larger init is for the large buffer (_gpuPhysicsVertices) and the other is for the smaller one (_gpuVertices) even though the difference between the two should not be as large. What’s more confusing to me are the little chunks of InitializeBuffer at the end of the timeline. I don’t know what those are.

Frame 6 is just the large spike where nothing much happens other than waiting for render thread and “LargeAllocation.Free”


Frame 7 and frame 8 there is nothing going on (as I would expect, since all the hard work is done already) other than editor work which doesn’t concern me

Then in frames 9,11-15,17-22 there is again Semaphore This time Gfx.WaitForPresentOnGfxThread.

Some questions I have.

  1. Is there a way to improve the performance of this?
  2. Am I doing something strictly wrong?
  3. Can and Should I split the workload?
  4. I would expect everything would be done after frame 5 where the “GetData” and “SetVertexBufferData” are called.

Any tips to improve performance are appreciated.

This is what the frames 9,11-15,17-22 roughly look like

The issue is that by default, Unity ComputeBuffers are device (GPU) RAM only. Unity calls this ComputeBufferMode.immutable. What this means is that when you call your AsyncGPUReadbackRequest, your CPU reads the data across the PCIE from GPU to CPU memory. This is extremely slow. As is shown in this blog post, reading straight from VRAM is ~6.67x slower than reading from device mapped CPU RAM, and over ~390x slower than reading from cached device mapped CPU RAM.

In your case, there are two possible solutions:

  1. If you are simply using the vertex data from your compute shader to draw objects on screen, and you don’t need it for any CPU side operations, then there’s actually no need to readback the data. Instead, you make a custom shader for the object you are drawing and bind the vertex data to that shader. Although it’s not exactly what you are doing, you can find a general idea on how to do this from Catlike Coding’s post on compute shaders (specifically section 2.2 retrieving positions).

  2. If you do need the data in CPU memory, the best you can do is to set your output buffer to ComputeBufferMode.Dynamic, which will keep the buffer in device mapped CPU RAM, at the cost of making your GPU accesses more costly. (This is because now your GPU has to read across the PCIE). I don’t know of any way to force the CPU RAM caching, as I don’t believe Unity exposes that functionality.

Good luck!

1 Like

You can avoid GC allocation with use of AsyncGPUReadbackRequest.RequestIntoNativeArray.