I’ve read (as late as 2020) that raw byte buffers are better than structured buffers for performance. Is this true?
One article I read (from NVIDIA) suggested not splitting stuff over cache lines, while another (from AMD) said you should use as little data as possible. Would it better to write to a buffer of float3s from multiple threads or a bufffer of float4s where I ignore the W?
And now for the big one… WTF is up with all the state changes and API calls when dispatching a compute shader? Dispatching the same kernel in URP using a CommandBuffer generates 13 state changes in RenderDoc each:
Thanks! So if I’m reading that correctly, that answers #1; structured buffer loads are the same or better than raw on PC/PS4/Xbone GPUs that support them except for some Intel integrated ones because compiler/driver can prove alignment. I assume stores aren’t too different. And that’s probably a reasonable answer for #2 to just use 16-byte aligned for everything where possible.
The massive numbers of pointless draw calls on the CPU side still concern me, though.
Even the stuff like VSSetShaderResources, etc? It seems like at least those 5 (VS/PS/GS/HS/DS) could be skipped if dispatching multiple compute shaders in a row, right? And the map/unmap of the same constant buffer?