Separate Archetypes based on Dynamic Buffer Internal Capacity

TL;DR: Per-archetype Dynamic Buffer Internal Capacity could improve performance in some cases by around 50% and be far more maintainable than alternatives. Any way to do so?

As in the title, I want to have a DynamicBuffer with an internal capacity of different amounts depending on archetype. I read a forum topic that is similar to my question at Setting A Dynamic Buffer’s Capacity , but I’ve also done additional experimenting and number crunching. The general answer in that topic was to either guesstimate a capacity that would work for most things, or create several IBufferElementDatas to be reinterpreted as a standard type.

Based on experimentation for my case, the optimal guesstimate method takes around 50% longer on average to execute than precisely sized buffers. The multiple Buffer types method I have not tested due to how obnoxious it would be to implement, in addition to the maintainability issues that would come with it, as the magic number from my testing for internal vs external capacity is 25, meaning I would need 25 different buffer types to hit optimal usage, which would need to be assigned in some grand switch statement or using reflection shenanigans. I could cut it down to 3 different ones and expect to be within about 30% of the optimal time to execute, plus sightly better chunk utilization, but that’s 30% I could otherwise have, and it doesn’t seem like the improved chunk utilization would be enough to offset the other factors in my case.

I also tried setting the DynamicBuffer’s Capacity property both above and below internal capacity, but saw no change in performance or chunk utilization after entity instantiation.

So, is there a way to do this added since the forum I linked that I have just missed? If not, is such a feature planned? A minimum 30% performance improvement along with improved maintainability and convenience would be nothing to sneeze at. Example syntax could be EntityManager.AddBuffer(Entity, InternalCapacity)

What you are asking for is a change that improves an extremely specific use case but potentially hurts the general use case. So I apologize for being that guy, but are you sure this is the thing you need to optimize?

50% of 2 milliseconds is totally different than 50% of 0.2 milliseconds. How much time do you think you would save in the final game? Are there any other expensive parts of the game that might be worth optimizing instead rather than requesting a feature?

I think the feature you might actually want is ArchetypeChunkBufferTypeDynamic. Such a type would allow you to store the ComponentType of a generic specialization of the IBufferElementData into a ChunkComponent.

1 Like

This is something that came up long ago and Joachim said they will look into it - low priority.

i have since used capacity(0) to put it on the heap immediately and the .asarray for linear access

I don’t see how adding this optional feature would hurt the general use case.
I’m making an versatile character control system, and all operation within it will go through this DynamicBuffer, so by speeding it up I’m essentially improving the final performance of everyone who uses my system by that amount, so I don’t have the luxury of just saying “ah, I don’t need to worry about that” unfortunately. This particular optimization means on my rig the difference between handling 1000 complex characters at 60fps and 1300 characters, or (as an estimate, no numbers to back this up) a battery life of 8 hours instead of 6 hours for an RTS game on a mobile device.
As for your last point, I’m having a little trouble following. Could you describe how that would work a bit more?

The post you linked is talking about chunk size, not DynamicBuffer capacity.
Yeah, for me the magic number to switch to heap using being better than an appropriately sized internal buffer was around 25 elements as I said, after which it’s better to use an InternalBufferCapacity of 0.
I would assume linear access happens by default, yeah? At least I haven’t seen a particularly huge performance hit using a DynamicBuffer over a NativeArray, at least not on the tier you would expect from nonlinear access. However, the hit from using heap directly instead of say a single element appropriately sized internal buffer over 2x, from 10ms to 22ms in my testing.

I don’t think it could be optional. The in-chunk size of the dynamic buffers in an archetype directly affect the data layout inside a chunk and how many entities fit within it. That would break the tag component optimization as a new tag means a new archetype and would potentially mean a new in-memory buffer size.

ComponentType is Unity’s struct-based alternative to System.Type used for locating data of a specific type within a chunk. When you use one of the generic methods, Unity actually fetches the ComponentType of the generic type and then uses that to access the data in the form of a void* pointer. This data then gets casted back into its concrete type from the generic method and returned to you. However, for IComponentData, you can bypass the genric casting into its concrete type and cast to whatever equally-sized type you want instead. This is done using ArchetypeChunkComponentTypeDynamic. The hybrid renderer uses it for per-instance properties. However, there is no equivalent API for ArchetypeChunkBufferType.

But as I’m writing this I realize this probably wouldn’t help you much more than just brute forcing it with an extractor struct written with T4 that contains all the generic length variants of ArchetypeChunkBufferType and when fed an ArchetypeChunk and an index it has a method to extract and reinterpret the buffer for that index.

I’m assuming your dynamic buffer size is the bone count of a mesh using a specific Shared Component? In that case, if your goal is to be handling this many characters, you should really be using animation textures. Otherwise, if it were me, I would spend my effort trying to get a 5x speedup by optimizing the Burst assembly rather than worry about this problem. Once the assembly is optimized and array access is truly the bottleneck (I’m really struggling to believe that given the numbers you provided as O(n) algorithms are rarely the bottleneck without ridiculous amounts of data), then my above proposal may offer you a solution (and I would totally use caching with ChunkComponents to speed up the archetype parsing).

1 Like

Thank you for your in-depth reply!

By Character Control System I mean a movement and physics interaction system, and by complex I mean on the tier of say Assassin’s Creed paired with Mario Galaxy, so lots of raycasts for spatial awareness and the like. But now that I write that, I realize how unlikely it would be that anyone would need 1300 fully featured anti-gravity assassins running around when 1000 wouldn’t suffice, not that it wouldn’t be cool.

That’s the whole idea, as I believe that’s what causes the speedup when I use more precisely sized data layouts. The way I see it working is just “in this chunk dedicate 8 bytes, and in this chunk dedicate 16 bytes to the relevant dynamicbuffers”, so the one with 8 bytes can fit more entities in each chunk and get more relevant data with each cache line. Since they’re in different chunks, I don’t see how that would affect optimizations, but then again I’m still relatively new to data oriented design.

It just seems like a very obvious feature that’s missing, being able to specify an array size, and if such a feature did exist, I’d be able to get a 50% speed up for free on top of the already incredible performance. So far, every time I’ve thought something was an obvious feature for improving performance with DOTS, it turned out there was a way to do it, but it seems that’s not the case here.

Ok, I think maybe you are having an increase in performance because the split itself and not the buffer size. Are you testing this logic isolated? How many chunks did you have before the split?
IJobForEach, IJobChunk, Entities.ForEach (JobComponentSystem) processes chunks per threads. If all your entities are in the same chunk you will not have any parallelism. Separated buffers will create different archetypes that will result in different chunks. This difference could be the real cause of the improved performance. In a real world context you will have other jobs running so this could be less impactful.

About how this can affect optimizations, it’s an structural change. It will affect every other system that use the same archetype. Maybe you can improve this system but this can have negativity impact in every other system. That’s why you shouldn’t profile/optimize isolated. Even something that can be paralleled well can have an overhead that impact the total time when you have a lot of jobs running together.

[ ]'s

Unfortunately I kind of have to since it’s a system I’m making to be used by other people rather than a full game itself, at the moment. Do you think a 50% difference in speed is too small to be relevant tested in isolation? I haven’t done too much profiling on completed games myself.

Unsure what you mean by “split”, but in my testing it’s operating on 500,000 entities, which are divided into anywhere between 1250 and 7575 chunks depending on what I use as the singular InternalBufferCapacity, which is what I’m looking to change per-archetype, to minimize the number of chunks created. The way I measured the speed was taking the median of 10 or so samples from what the profiler reported as the total ms on the job. In the most extreme case, an appropriate internal capacity of 1 takes 10ms total for the 500,000 entities, the guestimated optimal internal capacity of 3 (2 wasted regions in this case) gives 13ms, putting it directly on the heap is 22ms, and using the upper limit for worthwhile internal capacity in my case of 25 yields 31ms.

Nothing wrong with optimizing things that allow crazier stuff to happen on weaker hardware. I’m trying to understand your data and access patterns to get a feel of what algorithms are actually bottlenecking the system. Most likely changing the access pattern or data layout can bypass the issue you are seeing.

You missed my point. My point was that the cost of figuring out the chunk layout would go up significantly.

Anyways, if you are willing to share sample code and performance measurements, I am willing to help you optimize in other ways. Preferably pick the most expensive job in your system, as that will be the one worth optimizing the most.