3 questions about compute shader performance

  1. I’ve read (as late as 2020) that raw byte buffers are better than structured buffers for performance. Is this true?

  2. One article I read (from NVIDIA) suggested not splitting stuff over cache lines, while another (from AMD) said you should use as little data as possible. Would it better to write to a buffer of float3s from multiple threads or a bufffer of float4s where I ignore the W?

  3. And now for the big one… WTF is up with all the state changes and API calls when dispatching a compute shader? Dispatching the same kernel in URP using a CommandBuffer generates 13 state changes in RenderDoc each:

After the first one, the command buffer between them looks like this:

_cmd.SetComputeMatrixParam(_csExtrude, ID_OBJECT_TO_WORLD, transform);
_cmd.SetComputeIntParam(_csExtrude, ID_VERTEX_STRIDE, vertices.stride);
_cmd.SetComputeBufferParam(_csExtrude, kEdges, ID_EDGE_ADJACENCY, edges);
_cmd.SetComputeBufferParam(_csExtrude, kEdges, ID_VERTICES_IN, vertices);
_cmd.DispatchCompute(_csExtrude, kEdges, threadGroupsX, batchSize, 1);

Which definitely isn’t doing much with hull shaders. Ain’t that bad?

https://github.com/sebbbi/perftest

1 Like

Thanks! So if I’m reading that correctly, that answers #1; structured buffer loads are the same or better than raw on PC/PS4/Xbone GPUs that support them except for some Intel integrated ones because compiler/driver can prove alignment. I assume stores aren’t too different. And that’s probably a reasonable answer for #2 to just use 16-byte aligned for everything where possible.

The massive numbers of pointless draw calls on the CPU side still concern me, though.

@burningmime these are not draw calls, these are commands to setup the data for the dispatch.
Looking at the code, the extra calls there are intended.

1 Like

Even the stuff like VSSetShaderResources, etc? It seems like at least those 5 (VS/PS/GS/HS/DS) could be skipped if dispatching multiple compute shaders in a row, right? And the map/unmap of the same constant buffer?

1 Like

Perhaps. I don’t think that’s expensive, though.

Are you updating uniform data between dispatches? Using these SetComputeXXXParam?

Yes, I am updating uniforms, so I guess it needs to be reuploaded. It also seems to be setting the same buffers even if I do not change them.

If the API calls are nothing to worry about, then cool. Thanks for your help.

I don’t have hard data to back this up, but I suppose that’s the case.