GraphicsBuffer.SetData renders BRGs faster than LockBufferForWrite

I’m trying to run some proof of concept/performance tests on BatchRendererGroup for batches of repetitive data (vegetation/trees/etc) in place of Graphics.RenderMeshInstanced as Graphics.RenderMeshIndirect has been broken on specific machines.

I’m essentially running ~100 BRGs with a few hundred instances each (not ideal, but an easier migration to test). Each BRG is basically using the code out of the BatchRendererGroup documentation. The only real difference being that my matrix data is NativeArray<float4x4>, which I then use jobs to convert to two NativeArray<PackedMatrix> for the world/object data, and then call GraphicsBuffer.SetData(...) like the example.

Since this results in needing 2 more NativeArrays, I converted this block to instead use GraphicsBuffer.LockBufferForWrite(...), slice the returned NativeArray into two parts, and have my job write directly to that NativeArray.

Essentially:

// Run job...Then:
graphicsBuffer.SetData(zero, 0, 0, 1);
graphicsBuffer.SetData(matricesNativeArray, 0, (int)(byteAddressObjectToWorld / PackedMatrix.size), Count);
graphicsBuffer.SetData(inverseMatricesNativeArray, 0, (int)(byteAddressWorldToObject / PackedMatrix.size), Count);

changed to:

NativeArray<PackedMatrix> matrices = graphicsBuffer.LockBufferForWrite<PackedMatrix>((int)(byteAddressObjectToWorld / PackedMatrix.size), Count * 2);
NativeSlice<PackedMatrix> forwardMatrices = matrices.Slice(0, Count);
NativeSlice<PackedMatrix> inverseMatrices = matrices.Slice(Count, Count);
// Run job on forwardMatrices & inverseMatrices...
graphicsBuffer.UnlockBufferAfterWrite<PackedMatrix>(Count * 2);

It all works, but unexpectedly, this results in approximately a 10% increase in GPU frame timing in the Editor, Development builds, and approximately the same increase in release builds.

I’m not sure why this would matter - I expected differences in how long it may take my code to assemble the batch data and write it to the GraphicsBuffer, maybe differences in the job runtime due to caching or something… but once that code completes, it doesn’t run again, the data should be exactly the same, and I’m just submitting the same exact draw commands to BatchCullingOutput so I’m not sure why this would impact the actual render timings.

Is LockBufferForWrite structuring the memory differently somehow? Is there something I’m missing with how this executes? Or is this just a bug?

FWIW, this is Unity 2023.2 and HDRP 16.0.5. I know the BRG APIs have at least been updated but can’t tell if this is something that’s been addressed.

You’re most likely writing to a temporary buffer each time you call LockBufferForWrite, since the buffer is likely currently being used by the GPU for rendering. This means extra copying is required before the next frame.

You can try making 3 buffers (One per swap chain) and cycling between them each frame. (Eg currentFrameBuffer = graphicsBuffers[frameIndex % 3]; )

Unity doesn’t have any good APIs for querying this kind of thing unfortunately.

You’re most likely writing to a temporary buffer each time you call LockBufferForWrite, since the buffer is likely currently being used by the GPU for rendering. This means extra copying is required before the next frame.

I’d understand a slow down like this, and was prepared to do a circular buffer like you mentioned, but that’s not what I’m seeing.

It’s not just the next frame that’s slower or something - I’m writing the buffer in a setup step and then just passing the BRG along with no modifications in subsequent culling callbacks. I’m never modifying the data again, yet every single frame using this data renders 10% slower.

I’m not resubmitting changes, I’m not modifying the GPU data, I’m not rendering different data at all. But for some reason, if I constructed the buffer with LockBufferForWrite instead of SetData, the actual GPU render timings are worse.

It makes me think something about SetData is being allocated differently/more optimally for the GPU than the buffers being returned by LockBufferForWrite. I’d rather use LockBufferForWrite to save allocs and the docs push it as a fewer copies… But impacting the runtime performance is a much more significant cost…