A case for instantiating prefabs by replacing existing entities or entity pooling.

Hello, fellow DOTS users and developers. I’ve been working on a small project for the past few weeks using Unity’s Entities framework, trying to squeeze every last bit of performance. I was able to bring the simulation time down from 12-15ms to 4-5ms by replacing structural changes with polling, and using batch methods instead of EntityCommandBuffer. But I have a common pattern in my project, namely creating and destroying a lot of entities per frame, which is one of the bottlenecks. I’d like to propose a feature that will help alleviate it.

Roughly, I propose a method called EntityManager.InstantiateAndReplace that takes a prefab and an array of entities. The method copies the component data of the prefab into all of the specified entities and increments their version. This is equivalent to destroying the old entities and instantiating the new ones, but should be much faster since it doesn’t involve moving the entities in memory, just writing the data to the destination. And it can mostly be done in parallel, except probably for the version incrementing part. The method should also handle LinkedEntityGroup correctly, replacing the associated entities as well.

I’ll describe the specific scenario to provide some context. Basically I have a bunch of agents gathering wood. Each agent looks for a tree in its vicinity, approaches it and cuts it down. When the tree is cut, it’s destroyed and a stump and a trunk are created in its place. The agent picks up the trunk and brings it to the nearest storage. Simultaneously, to maintain the constant number of entities in the simulation, each frame the trees are spawned, and the trunks and stumps are destroyed after some period of time. All of this is happening at a rapid rate: each frame of the simulation corresponds to several seconds of “real” time, and the goal is to maintain a fixed 60 fps of the simulation at the highest game speed.

In this scenario, with 100.000 trees and 20.000 agents, roughly a 1000 entities are created and destroyed each frame. Right now I just destroy all the entities in batch and instantiate the new ones. This takes about 1ms of the total 5ms of the simulation frame time (mostly the destruction part). This also heavily fragments the chunks for some entities, so the real cost is higher. But if I’m able to instead replace the entities pending for destruction with the ones being instantiated, the cost should go down to basically zero.

I must mention that I already implemented the pooling system for some entities, specifically the ai tasks. They have a single main component unique to the task type and a set of component types they all share, so I just explicitly fill them upon recreation (with a separate system for each task type, all based on a single generic system). The cost did go down to basically zero for them. But for most other systems I need a more general solution, since the set of components is unknown at compile time and is parametrised by the prefab.

In the mean time, to test the cost of random destruction and instantiation in batches, I made a simple project. It’s available here: https://github.com/dmitry-egorov/unity-ecs-batch-speed-test/blob/main/Assets/Scripts/Instantiation.cs. Look at the destroy_and_instantiate system. Basically, each frame it destroys a random subset of entities and creates new ones for replacement.

destroy_and_instantiate system

protected override void OnUpdate() {
    // remaining frames of the current iteration
    var remaining_frames = GetSingleton<has_current_iteration>().remaining_frames;
   
    // randomly determine entities to destroy. Gather them in the destroyed_entities list.
    var subjects_count = subjects_query.CalculateEntityCount();
    using var destroyed_entities = new NativeList<Entity>(subjects_count, TempJob);
    var j1 = Entities.WithName("EnqueueDeletes")
    .WithAll<is_a_test_subject>()
    .ForEach((int entityInQueryIndex, Entity e) => {
        if (hash(uint2((uint) entityInQueryIndex, remaining_frames)) < (uint.MaxValue / 4 * 3))
            destroyed_entities.AddNoResize(e);
    })
    .WithStoreEntityQueryInField(ref subjects_query)
    .Schedule(Dependency);

    // gather entities to spawn, generate positions for them
    var count = spawns_query.CalculateEntityCount();
    using var spawn_prefabs = new NativeList<Entity>(count, TempJob);
    using var spawn_counts = new NativeList<int>(count, TempJob);
    using var positions = new NativeList<float2>(count * 100, TempJob);
    var j2 = Entities.WithName("GeneratePositions")
    .ForEach((int entityInQueryIndex, in spawns spawns, in Translation translation) => {
        var current_prefab_i = spawn_prefabs.Length - 1;
        var instance_count = (int)spawns.count_per_fame;
        var prefab = spawns.prefab;
        if (current_prefab_i != -1 && spawn_prefabs[current_prefab_i] == prefab)
            spawn_counts.get_ref(current_prefab_i) += instance_count;
        else {
            spawn_prefabs.Add(prefab);
            spawn_counts.Add(instance_count);
        }

        var spawn_hash = hash(uint2((uint)entityInQueryIndex, remaining_frames));
        var random = Random.CreateFromIndex(spawn_hash);
        for (var i = 0; i < instance_count; i++) {
            var position = translation.Value.xz + random.NextFloat2(new float2(-100, -100), new float2(100, 100));
            positions.Add(position);
        }
    })
    .WithStoreEntityQueryInField(ref spawns_query)
    .Schedule(Dependency);
   
    // wait for the scheduled jobs to complete
    using (Profile("Wait for jobs"))
        JobHandle.CombineDependencies(j1, j2).Complete();
   
    // destroy entities
    using (Profile($"Destroy {destroyed_entities.Length} subject entities"))
        EntityManager.DestroyEntity(destroyed_entities);

    // instantiate entities
    var instances_count = positions.Length;
    using var instances = new NativeArray<Entity>(instances_count, TempJob);
    using (Profile($"Instantiate {instances_count} entities")) {
        var data_index = 0;
        for (var spawn_i = 0; spawn_i < spawn_prefabs.Length; spawn_i++) {
            var spawn_count = spawn_counts[spawn_i];
            var sub_array = instances.GetSubArray(data_index, spawn_count);
            EntityManager.Instantiate(spawn_prefabs[spawn_i], sub_array);
            data_index += spawn_count;
        }
    }
   
    // set positions of the instantiated entities
    using (Profile("Set positions")) {
        Dependency = new set_positions_job {
            instances = instances,
            positions = positions,
            translation_w = GetComponentDataFromEntity<Translation>()
        }.ScheduleParallel(instances_count, 7, Dependency);
        Dependency.Complete();
    }
}

[BurstCompile] struct set_positions_job: IJobFor {
    [ReadOnly] public NativeArray<Entity> instances;
    [ReadOnly] public NativeArray<float2> positions;
    [NativeDisableContainerSafetyRestriction] [WriteOnly] public ComponentDataFromEntity<Translation> translation_w;
    public void Execute(int i) =>
        translation_w[instances[i]] = new Translation { Value = positions[i].x0y() };
}

The creation and destruction of 2500 entities takes around 1.5ms (mostly the destruction), and everything else is basically negligible:

So, i’d like to get some feedback on the problem and the proposed solution. Maybe someone already implemented something similar in their project? Or maybe there’s another solution for this? Please, let me know.

Edit: fixed some errors in the code and grammar

What makes this tricky is that you need to ensure that the destroyed entities match the archetype of the prefab minus the prefab component. But even if you have that, what makes this brutally difficult is mapping the LinkedEntityGroup correctly. I can’t think of a way to do that without heavy random access, so at that point I think you would be better off destroying and creating new entities.

But destroying entities is a lot slower than necessary because Unity tends to destroy them in tiny unordered batches.

I’m actually running more expensive sync points from instantiation and destruction. I decided to optimize instantiation from command buffers by creating a custom command buffer that does the batching but also initialization of data. However, I also observed performance issues when destroying entities and I’ve explained the issue in more detail here: Latios-Framework/Documentation~/Optimization Adventures/Part 4 - Command Buffers 1.md at v0.3.1 · Dreaming381/Latios-Framework · GitHub

I agree, the general solution would be a bit complicated. There are questions like, what to do if the target entity has extra components? Set them to their default values or leave them as is? What to do if the target lacks components that the prefab has? Add the components to the entity, throw an exception or just ignore them? What to do if the target array has duplicate entities?

It’s complicated, but I still think it’s feasible. It requires configuring the behaviour with a parameter or splitting the api into a few methods with different semantics. At the very least, I’d like to see a general method that copies data of all the components from one entity to another. This would solve half of the problem already. Maybe you know of an implementation of such a method?

As for the random access, I think it’s just the nature of the domain problem in our case. There’s no way to tell which entities will be destroyed beforehand and group them in chunks without making structural changes each frame with even higher cost. If there was a way to do that, we’d just destroy the entities with a query, which is free.

Since this is the case, we need to accept the fact that we will be accessing memory randomly and optimise what we can. And since writing to a random set of entities can be done in parallel, it is at least an order of magnitude faster, compared to shuffling the same random set in memory on a single thread.

That’s an interesting read, thanks! I glanced over it for now, will read it more thoroughly, when I have more time.

One comment I’d like to make is that I believe the sorting is the weakest part of the EntityCommandBuffer design from the performance standpoint. Considering that the general cost of it is O(nlogn), and that it is done on a single thread, any gains you make by scheduling your commands in parallel, are immediately overwhelmed by the cost of sorting. That’s why I tend to gather the data for structural changes in order, and then use batch operations (you can see that in the code from my previous post). And if there are some expensive operations to make beforehand, they can be performed in a separate parallel loop, writing the results into an array with entityInQueryIndex, and then the array processed and filtered in order. That tends to be faster than concurrently gathering the data in a queue and then sorting it on a single thread.

Adding a component to an existing entity can be as expensive as destroying an entity and instantiating a new one. So if the archetype doesn’t match, this is no longer a valid optimization. That’s what makes it tricky.

There’s been a few threads about this. The real issue is updating the entity-in-chunk tables to ensure the index in chunk refers to a new entity.

If you profiled this you would know this isn’t true at all. The bottleneck is that even with the batch API, Unity tends to destroy entities entity-by-entity, randomly jumping between chunks to do so, so every component it destroys ends up being a cache miss for every single entity. Sorting is pretty cache-optimized in comparison and can actually lead to performance improvements.

1 Like

Yep, that’s absolutely correct, adding a component is a bit more expensive than destroying an entity. I was talking about a general api, though, and that is the behaviour most users would expect from a method called InstantiateAndReplace. Hence the need for configuration parameters or a more granular api.

I’ve seen a few posts on the topic, so it seems like the demand for it is there. Can you point me to the specific discussion on the entity-in-chunks? I’d appreciate it. From my understanding, since the entity stays in the same place in the chunk, the only thing that needs to change is its version number. I’d like to be corrected on that.

I wonder how do you profile cache misses in Unity? Last time I tried to profile a Unity app with V-Tune around a year ago, I spent a whole day and didn’t get any results in the end, it just refused to work. Either showed non-sensical data or no data at all…

Anyway, it seems to me, you’re gathering the batch in parallel for the batch API. Is this right? If that’s the case, that is absolutely one of the causes for a lot of unnecessary cache misses. It also makes the program non-deterministic (might not be something that matters for a specific program, but still a consideration). If I understand correctly (I might not), you’re sorting the batch to mitigate the issue. But that’s more expensive than just gathering the batch sequentially in order in the first place. That was my point.

As a side note, if you have to sort, there’s a better sorting strategy than general radix or heap sort, if you can make certain assumptions. I’m talking about this one: https://www.geeksforgeeks.org/merge-k-sorted-arrays/. If you gather the data into thread-local lists (by using threadIndex), and if you can assume each thread gathers the data in order (in other words, each thread-local list is already sorted), you can merge the lists, instead of sorting the whole data set. It should be O(nlog(k)), instead of O(nlog(n)), where k is the number of threads. The minheap implementation should be cache friendly as well. I think Unity doesn’t use this strategy, because they can’t in general guarantee that the lists are sorted, since that depends on the way the API is used. But in practice you should be able to make this assumption more often than not.

Anyway, I was talking about comparing EntityCommandBuffer.ParallelWriter and the batch API in the first place. I take the batch API as an optimal baseline, given the batch is gathered in order. If the entities are close to each other in memory, they will be close in the batch, and you don’t take a lot of cache misses. If the entities are far from each other – you have to take a cache miss, no way around it.

When I tested this with my random destruction scenario, the EntityCommandBuffer was 1.5-2x times slower. I don’t have the specific numbers around, but that’s the result I remember. Since the only difference between them is the sorting, I assume it takes that much time. Either that, or the sorting doesn’t produce an optimal order (don’t see a reason for that to be the case, though). This result will of course depend on how sparse your entities are in memory in the first place. If they’re very sparse, the cache misses might overwhelm the cost difference, and that’s probably what happens in your case. But as I said, there’s no way around it in general. Only in specific cases where you can move entities closer to each other using some domain knowledge.

There’s also some API for getting created and destroyed entities, which requires book-keeping. And there may be stuff related to subscenes and live link as well. I haven’t investigated this space though, nor am I aware of any threads that go into this.

Yes. My use case is different from yours.

Actually not really. My custom command buffer data structure is extremely cache-efficient.

I sort it to remove the issue altogether.

I never implied that what you are doing wasn’t faster than what I do. Your use case has more metadata about when things can be destroyed, so naturally you can make more aggressive optimizations. My point is that the sorting is relatively cheap compared to the batch destruction.

Again, you are making assumptions I cannot make. And for what it is worth, ECB makes this assumption (it uses a priority queue of ordered chains) and it absolutely chokes when iterating entities backwards, which is common in some list pruning algorithms.

Radix sort is O(k*(n + m)) where k is the number of radixes required to represent the key and m is the amount of numbers representable by the radix. The sweet spot for radix sort is between 4k - 40k elements based on my testing. That happens to be in-line for when structural changes are expensive enough to need optimization but not expensive enough to be unviable.

Not only do the entities have to be close together in memory, they have to be sequential in the chunks and in the input array you provide EntityManager. Otherwise, there is no batching, and components are destroyed (which is the act of copying another entity’s components in its place) one by one, which thrashes cache and components are not adjacent in memory). For very small entities, this may not be a big deal. But the components added by HR V2 are enough to generate evictions in L1. In addition, LinkedEntityGroup gets destroyed recursively by the batches of the roots, meaning if you have lots of batches of 1 (which is common), then you have lots of small batches of entities destroyed by LinkedEntityGroup rather than one large batch.

6998204--826829--upload_2021-4-1_12-48-20.png

The reason for that is that swap-back-on-destroy will generate entropy over time and reduce performance over time. I don’t know what your criteria is for destroying an entity, but most use cases get burned by this.