3 questions about compute shader performance

burningmime · October 8, 2021, 3:08am

I’ve read (as late as 2020) that raw byte buffers are better than structured buffers for performance. Is this true?
One article I read (from NVIDIA) suggested not splitting stuff over cache lines, while another (from AMD) said you should use as little data as possible. Would it better to write to a buffer of float3s from multiple threads or a bufffer of float4s where I ignore the W?
And now for the big one… WTF is up with all the state changes and API calls when dispatching a compute shader? Dispatching the same kernel in URP using a CommandBuffer generates 13 state changes in RenderDoc each:

After the first one, the command buffer between them looks like this:

_cmd.SetComputeMatrixParam(_csExtrude, ID_OBJECT_TO_WORLD, transform);
_cmd.SetComputeIntParam(_csExtrude, ID_VERTEX_STRIDE, vertices.stride);
_cmd.SetComputeBufferParam(_csExtrude, kEdges, ID_EDGE_ADJACENCY, edges);
_cmd.SetComputeBufferParam(_csExtrude, kEdges, ID_VERTICES_IN, vertices);
_cmd.DispatchCompute(_csExtrude, kEdges, threadGroupsX, batchSize, 1);

Which definitely isn’t doing much with hull shaders. Ain’t that bad?

Shane_Michael · October 8, 2021, 4:01am

https://github.com/sebbbi/perftest

burningmime · October 8, 2021, 8:47am

Thanks! So if I’m reading that correctly, that answers #1; structured buffer loads are the same or better than raw on PC/PS4/Xbone GPUs that support them except for some Intel integrated ones because compiler/driver can prove alignment. I assume stores aren’t too different. And that’s probably a reasonable answer for #2 to just use 16-byte aligned for everything where possible.

The massive numbers of pointless draw calls on the CPU side still concern me, though.

aleksandrk · October 8, 2021, 8:52am

@burningmime these are not draw calls, these are commands to setup the data for the dispatch.
Looking at the code, the extra calls there are intended.

burningmime · October 8, 2021, 9:48am

Even the stuff like VSSetShaderResources, etc? It seems like at least those 5 (VS/PS/GS/HS/DS) could be skipped if dispatching multiple compute shaders in a row, right? And the map/unmap of the same constant buffer?

aleksandrk · October 8, 2021, 11:19am

Perhaps. I don’t think that’s expensive, though.

Are you updating uniform data between dispatches? Using these SetComputeXXXParam?

burningmime · October 8, 2021, 11:37am

Yes, I am updating uniforms, so I guess it needs to be reuploaded. It also seems to be setting the same buffers even if I do not change them.

If the API calls are nothing to worry about, then cool. Thanks for your help.

aleksandrk · October 8, 2021, 12:10pm

I don’t have hard data to back this up, but I suppose that’s the case.

Topic		Replies	Views
ComputeShader.Dispatch from off-thread or Job System? Unity Engine Shaders , Question	15	3726	January 12, 2024
Check if a ComputeShader.Dispatch() command is completed on GPU before doing second kernel dispatch Unity Engine Graphics	34	16285	February 24, 2023
Buffers and Compute Shaders Unity Engine Scripting	17	23392	May 24, 2017
Redundant Constant Buffer Writes Between Compute Shader Dispatches Unity Engine Shaders , DirectX , Performance , 2022-3-LTS , Advanced , Bug	6	260	October 29, 2025
Does using RWStructuredBuffer using as StructuredBuffer has performance penalty? Unity Engine Shaders , Performance	9	2479	July 3, 2024

3 questions about compute shader performance

Related topics