I’m trying to run some proof of concept/performance tests on BatchRendererGroup
for batches of repetitive data (vegetation/trees/etc) in place of Graphics.RenderMeshInstanced
as Graphics.RenderMeshIndirect
has been broken on specific machines.
I’m essentially running ~100 BRGs with a few hundred instances each (not ideal, but an easier migration to test). Each BRG is basically using the code out of the BatchRendererGroup documentation. The only real difference being that my matrix data is NativeArray<float4x4>
, which I then use jobs to convert to two NativeArray<PackedMatrix>
for the world/object data, and then call GraphicsBuffer.SetData(...)
like the example.
Since this results in needing 2 more NativeArrays
, I converted this block to instead use GraphicsBuffer.LockBufferForWrite(...)
, slice the returned NativeArray
into two parts, and have my job write directly to that NativeArray
.
Essentially:
// Run job...Then:
graphicsBuffer.SetData(zero, 0, 0, 1);
graphicsBuffer.SetData(matricesNativeArray, 0, (int)(byteAddressObjectToWorld / PackedMatrix.size), Count);
graphicsBuffer.SetData(inverseMatricesNativeArray, 0, (int)(byteAddressWorldToObject / PackedMatrix.size), Count);
changed to:
NativeArray<PackedMatrix> matrices = graphicsBuffer.LockBufferForWrite<PackedMatrix>((int)(byteAddressObjectToWorld / PackedMatrix.size), Count * 2);
NativeSlice<PackedMatrix> forwardMatrices = matrices.Slice(0, Count);
NativeSlice<PackedMatrix> inverseMatrices = matrices.Slice(Count, Count);
// Run job on forwardMatrices & inverseMatrices...
graphicsBuffer.UnlockBufferAfterWrite<PackedMatrix>(Count * 2);
It all works, but unexpectedly, this results in approximately a 10% increase in GPU frame timing in the Editor, Development builds, and approximately the same increase in release builds.
I’m not sure why this would matter - I expected differences in how long it may take my code to assemble the batch data and write it to the GraphicsBuffer
, maybe differences in the job runtime due to caching or something… but once that code completes, it doesn’t run again, the data should be exactly the same, and I’m just submitting the same exact draw commands to BatchCullingOutput
so I’m not sure why this would impact the actual render timings.
Is LockBufferForWrite
structuring the memory differently somehow? Is there something I’m missing with how this executes? Or is this just a bug?
FWIW, this is Unity 2023.2 and HDRP 16.0.5. I know the BRG APIs have at least been updated but can’t tell if this is something that’s been addressed.