When we design a map, we want to store map data as an array of ids.
When we render the map, we want to loop over these ids and gather relevant data for rendering.
But when using DrawMeshInstanced, we want to loop over meshes instead, and give them relevant data.
Which means we need to create some intermediate data structure, on each frame.
Most DrawMeshInstanced examples present the best case scenario, where there are few meshes. But in an actual game, what happens is you have a lot of different meshes, so I am wondering:
In actual games, is batching these rendering data, then call DrawMeshInstanced by each unique mesh, actually faster than just looping over ids and call DrawMesh with relevant data? (note that DrawMesh does support instancing.)
In a sense, I am asking: is the overhead of creating a dynamic list of unique meshes and rendering data, each frame, actually worth it?
I understand DrawMesh must also do this internally for instancing to work, so my question comes down to: Can we reasonably beat Unity DrawMesh at its game? If so, what data structure would be best, a large Array or List?
I think what we are currently doing, is similar to Unity ECS does: decouple per-instance data (position, rotation, material properties etc.), from per-mesh data (mesh, material), then group per-instance data with good memory layout, then loop over them.
But the fact that DrawMeshInstanced / DrawMeshInstancedIndirect want us to call it per-mesh, not per-instance, means we must create intermediate data, every frame. Doesn’t it nullify our effort at accessing memory efficiently, somewhat?
Since Unity ECS has demonstrated it has superior performance: Does it mean we can, ourselves, also create such data structure efficiently, per frame? (preferably without using unsafe code, which ECS uses.)
Sorry this post turns out to a bit longwinded, as I want to put down my thought process, my TL;DR question: if the wisdom from ECS is, we can create data structure per frame and feed them to DrawMeshInstanced and still get great performance, can we do it ourselves, using C#, and get decent performance boost?
Thx, I hope people can share their experience here!
You have all the instance information available as a transform matrix, right? And you are concerned that you’re wasting precious frame time building an array of said matrices to pass it to DrawMeshInstanced each frame? Why do you build it each frame to begin with? Why not have an array that persists and only do updates to it’s contents when the instance information in question changes?
The way I see it, it makes perfect sense that DrawMeshInstanced is called per mesh because… well… GPU instancing is exactly that. You pass one mesh and then the GPU will draw that mesh as many times as you tell it to. Calling it per instance means that Unity would have to have an instance accumulation buffer that would expand each time you can a specific instance. Now that doesn’t make much sense, since you know better then Unity how many instances you want to draw.
Exactly, so I end up storing these Meshes as an Array, then maintain an active array of Meshes per frame: this is a bit counter-intuitive as “intuitively” we want to do less work, but less work in this case is actually slower.
(Just to clarify: my previous concern was with building a mesh array per frame, from a pool of available meshes, using some kind of hashes as lookup key; the position data are in an array already, so no extra processing there.)
Doing this is actually the same approach used by ECS: AFAIK they store Component together based on types too.
By the way I am not saying DrawMeshInstanced should be called per-instance, I was just trying to figure out the best trade-off, and my conclusion is optimized for memory access has by far the most significant performance gain, much better than using some clever data structures.
Not to be a stickler for terms here, but you only keep one copy of the Mesh, it’s the entities(position, rotation, scale) that you need to store multiples of. Writing this to clarify any misunderstanding.
And you don’t necessarily need to keep an additional array for active entities. Here’s a crazy idea to try: So you can see that the function DrawMeshInstanced takes both an array of matrices and an instance count. So you can pass an array containing inactive entities in the tail and as long as you supply the correct instance count it should work. I suspect this is by design exactly for our convenience. All you need is a little bit of clever array management to make sure all you inactive entities end up in the tail of the array. Whether this is faster than an intermediate array is, however, up to exploration. I believe it can be, but I’m not the most experienced cache hit optimizer out there.
So I’ve got this bloxel game (think minecraft) with “meshed” blocks for anything that doesn’t fit into the game as a voxel. A test world I used for this has about ~120000 instances of such meshed blocks - they’re about 30-100 vertices each. It takes about 8-10 ms on the CPU to gather all the instances into arrays, then calling Graphics.DrawMeshInstanced() takes about 2-3 cpu ms for the entire thing. The 8-10 ms is spread across background threads while the main thread does other things, so it’s no issue (as long as you’re not playing it on a dual core laptop ). They’re drawn in batches up to size 1000.
I store it like this:
main thing → Dictionary<BlockType, BlockTypeInstances> & List
this splits the instances up by mesh type. Seperate list for better iteration.
BlockTypeInstances → a custom SortedList<Vector3Int, BlockChunk>
this splits the instances into lists based on positions. It’s a custom version of the sortedlist to have better access to the internal structure for some shortcuts (like only searching for the index once in a “trygetvalue → add” situation).
Culling and determining whether it’s in shadowing distance is done at this level.
BlockChunk → a List<Tuple<Matrix4x4, List>
I’ve commonly got only a few rotations+scales of a mesh type active but a lot of instances. So I store the matrices sparsely. Vector3Byte is the offset from the blockchunk root in 2), I only need whole offsets (smaller ones are baked into the mesh).
So my innermost loop just loops over the List, makes a local copy of the matrix4x4, adds the position offset to it, then adds it into a Matrix4x4[ ] buffer. When the buffer is full it’s sent to the main thread to be inserted into Graphics.DrawMeshInstanced() and a new buffer is fetched.
Regarding unsafe code: The only need I see for unsafe code is:
A) converting struct data from one type to another (without making another copy) - but this is limited as many APIs don’t support pointers, so it’s mostly for your internal use. (Though maybe the NativeArray can be used for this? I’m still on 2017.4, so idk).
B) direct editing stored struct data - I guess it’s limited to your own container situations, and hopefully in future we can use c# ref returns to do it in a managed way.
C) large temporary buffers - to prevent the mono heap size from exploding and lots of gc rounds being triggered
Side-rant regarding the issue of copying data all the time - yeah it’s an issue. I have the following data path in my game (using pointers due to point C) from above)
a) a background thread creates mesh data based on the world (about 50-400 MB worth depending on settings). It builds it into a manually allocated permanent buffer.
b) It manually allocates a precisely fitting temp buffer, copying the permanent buffer into it
c) it queues it on the main thread to be applied to a unity mesh
d) the main thread copies the temp buffer into the backing array of the List<> buffers unity wants, frees the temp buffer. Need to get the backing array with reflection/emit to circumvent List<>.Add() overhead
e) the main thread calls mesh.SetVertices(), which causes unity to make a copy of it
So here the data is copied at least 3 times (+ any times unity does it internally).
If mesh.SetVertices() worked on other threads, none of those 3 copies would exist. This is probably possible in the modern backends (d3d12, vulkan) but would require a unity rewrite. I hope their ECS progress will add more multithread capable api entries.
I’m quite sure mesh.SetVertices internally grabs the list<> backing array, fixes it for GC and then do things with the pointer. It would be great if the API just exposed the array and/or pointer methods directly.