Performant way to add & remove Lists of structs from buffer

My world is split into many (tens of thousands) of Chunks that each have a list of structs containing matrices for objects that should be GPU instanced when that Chunk is visible to the player (determined by mesh renderers on each Chunk). In order to save on draw calls, I have one universal GraphicsBuffer of these structs that I render with Graphics.RenderMeshIndirect.

I would like to efficiently add/remove each Chunk’s structs from the buffer possibly many times per second (as Chunks are loaded, unloaded, become visible, become invisible, etc.). What would be an efficient/performant way of doing this?

Here is some pseudocode:

public struct GPUInstancingStruct {
    public float4x4 matrix;
}

// Many Chunks in world
public class Chunk {
    public List<GPUInstancingStruct> structs;
}

// One GPUInstancer
public class GPUInstancer : MonoBehaviour {
    GraphicsBuffer buffer;

    private void Start() {
        UpdateBuffer();
    }

    // Very inefficient
    private void UpdateBuffer() {
        // combine structs from all Chunks in world into one list
        List<GPUInstancingStruct> allStructs = new();
        foreach (Chunk chunk in World.Chunks) {
            allStructs.AddRange(chunk.structs);
        }

        // is there a way to reuse the same buffer for a different amount of structs?
        buffer?.Dispose();
        buffer = new(GraphicsBuffer.Target.Structured, allStructs.Count, sizeof(float) * 16);
        buffer.SetData(allStructs);
    }

    // Can be called many times per second
    public void OnChunkAdded(Chunk chunk) {
        // TODO: need some performant way of adding chunk's structs to buffer
        // instead of recombining ALL chunks into one list again with UpdateBuffer
        UpdateBuffer();
    }

    // Can be called many times per second
    public void OnChunkRemoved(Chunk chunk) {
        // TODO: need some performant way of removing chunk's structs from buffer
        // instead of recombining ALL chunks into one list again with UpdateBuffer
        UpdateBuffer();
    }

    private void Update() {
        RenderMeshesUsingBuffer();
    }
}

I thought of saving all structs in a list in GPUInstancer and having each Chunk store its index in the list and struct count, so adding and removing Chunks would look like this:

public void OnChunkAdded(Chunk chunk) {
    chunk.index = allStructs.Count;
    chunk.count = chunk.structs.Count;
    allStructs.AddRange(chunk.structs);
    CreateBuffer();
}

public void OnChunkRemoved(Chunk chunk) {
    allStructs.RemoveRange(chunk.index, chunk.count);
    // need to update chunk.index for every Chunk after this one...
    UpdateChunkIndicesForChunksAfterThisOne(chunk);
    CreateBuffer();
}

private void UpdateChunkIndicesForChunksAfterThisOne(Chunk chunkIn) {
    foreach (Chunk chunk in World.Chunks) {
        if (chunk.index > chunkIn.index) {
            chunk.index -= chunkIn.count;
        }
    }
}

private void CreateBuffer() {
    buffer?.Dispose();
    buffer = new(GraphicsBuffer.Target.Structured, allStructs.Count, sizeof(float) * 16);
    buffer.SetData(allStructs);
}

This still needs to recreate the buffer every time a Chunk is added/removed and iterate over every Chunk to update Chunk.index every time a Chunk is removed, so I don’t know if it is the most performant option. “allStructs.RemoveRange” also has to perform a copy for each struct after the end of the range of items, so it could be up to O(n).

Assuming having a buffer that is larger than what you need isn’t a problem (RenderMeshIndirect should allow you to specify how many elements are useful in the buffer), the fastest way is first to avoid recreating the buffer all the time, second to avoid doing extra intermediate copies :

public class Chunk
{
    // adjust to how many minimium elements you expect here
    private const int defaultCapacity = 4;
    // NativeList instead of List so copying to a nattiveArray is easier/faster
    public NativeList<GPUInstancingStruct> structs = new(defaultCapacity, Allocator.Persistent);
}

public class GPUInstancer : MonoBehaviour
{
    GraphicsBuffer buffer;

    void Update()
    {
        int structCount = 0;

        // find total count of structs
        for (int i = World.Chunks.Count; i-- > 0;)
            structCount += World.Chunks[i].structs.Length;

        // if buffer isn't created, or isn't large enough
        if (buffer == null || buffer.count < structCount)
        {
            // dispose of the old buffer if it exists
            buffer?.Dispose();
            // create a new buffer with a capacity 1.5x larger than needed
            buffer = new GraphicsBuffer(GraphicsBuffer.Target.Structured, GraphicsBuffer.UsageFlags.LockBufferForWrite, (int)(structCount * 1.5f), UnsafeUtility.SizeOf(typeof(GPUInstancingStruct)));
        }

        // get direct access to the GPU memory (if supported, if not, still faster then using SetData)
        NativeArray<GPUInstancingStruct> bufferData = buffer.LockBufferForWrite<GPUInstancingStruct>(0, structCount);

        // copy the structs of all chunks in the GPU memory
        int bufferIndex = 0;
        for (int i = World.Chunks.Count; i-- > 0;)
        {
            NativeArray<GPUInstancingStruct> structs = World.Chunks[i].structs.AsArray();
            NativeArray<GPUInstancingStruct>.Copy(structs, 0, bufferData, bufferIndex, structs.Length);
            bufferIndex += structs.Length;
        }

        // ends the write operation, dispose the bufferData NativeArray
        buffer.UnlockBufferAfterWrite<GPUInstancingStruct>(structCount);

        // do your stuff
        RenderMeshesUsingBuffer();
    }
}

If you don’t expect the chunk data to change every frame, you can avoid updating the buffer by setting a “bufferIsDirty” flag from your “OnChunkAdded()” and “OnChunkRemoved()” methods, and only updating the buffer if that flag is set.

1 Like

Well, we don’t know how much the number of structs inside a single chunk change and vary from chunk to chunk. Those are important metrics to take into accound when you want to use some kind of “smart” system. Of course you should never really destroy or recreate your buffers unless it’s too small (like Gotmachine already mentioned). Copying large amounts of data isn’t that bad after all, though I wouldn’t update that buffer too quickly. When you have really a lot of chunks it usually makes sense to only purge the buffer every now and then. For example once every second or when the number of chunks to remove exceed a certain threshold. You can always add new chunks at the end. You may want to use some additional lists to track certain states.

If the number of objects in each chunk stays constant and those numbers don’t differ too much, it would also be possible to somehow partition the buffer so each partition can easily by “defragmented” when a chunk needs to be removed. However when the number of object per chunk are vastly different, this wouldn’t really give any benefit and just results in much more overhead. In that case it’s much easier to just rebuild / refill the buffer with the active chunks.

Depending on how fast you can move through the world, it might make sense to have two buffers, one long-term and one short-term buffer. A bit like a generational garbage collector. Of course thie also requires additional management. The chunks would essentially count the number of frames they are visible and at the next long-term buffer update, chunks that had been visible for a long time would be moved there.

When the order of the objects in the buffer is irrelevant, it’s also possible to use index lists so each chunk can remember at which “indices” it’s objects are located. This would allow O(1) removal as objects in the middle of the buffer could be removed by replacing the slot in the buffer with the last element in the buffer. Of course this “move” would need to be tracked so additional book keeping.

In the end everything comes down to the actual numbers (min/max objects per chunk, number of chunks, how often the number of objects inside a chunk will change, etc.)

Some time ago I made a parallel mandelbrot renderer just for fun. Here’s a webGL build of it. Here I actually have a class instance for each pixel in that texture. 576000 of them. Each object stores two complex numbers, an interation counter and the index into the linear texture array. So a total of 4 double values, and two integers. I use two Lists which are alternated. So one is cleared and all objects are moved from one list to the other every iteration. Those “pixels” which have settled are actually removed from the list. So the active list only contains the remaining pixels that need to be further iterated (the black ones). Even though this is a WebGL build on a decent PC it actually runs fine, Note it doesn’t run as fast as possible as it only does one iteration per (visual) “Update()”. Since the order of the objects is irrelevant for the functionality, I could probably exchange the objects with structs and simply replace “deleted” slots with the last one. Though I doubt it would change much.

In your specific case, maybe replacing chunks which are temporarily not visible with a degenerative matrix (everything just set to 0) so the object would probably be culled early and certainly would not generate any fragments. When each chunk remembers their position in the buffer, they can selectively be “activated” and “deactivated”. This might allow an even larger update interval. Though again, all boils down to the actual ranges we talk about.

1 Like

Thank you so much for the advice Gotmachine and Bunny83! I did not know about the performance benefits of using buffer.LockBufferForWrite and using NativeArrays/NativeLists for copying!

I went with a similar design to yours with a “bufferIsDirty” flag to only update the buffer at most once per frame for now. With a moderate test setup in a build, it’s only taking around 1ms to update the buffers, which I’m happy with for now!

Thanks again!