DynamicBuffer<T> performance doesn't scale - What to do?

Hello all!

I’m having trouble with DynamicBuffer as performance doesn’t scale that well, pretty horrible actually.
Scorr in Discord had the tip for me to use [InternalBufferCapacity(0)] to move buffers out of chunks, which helped but not enough.
In my stress test I’ve 250k entities with a few DynamicBuffers. Some smaller, some bigger. One in particular has 1 int and 2 uints. Not that much I’d say.

Just running a ForEach in a SystemBase and such a buffer in the query is putting a lot of strain on the MainThread. The problem can be brought down a little with ScheduleParallel but it’s still in the ballpark where the cost is way too much.

Now the absurd part, all these buffers are empty! Nothing is done in these systems, just the query and mere existence of the job is enough to put a heavy load on the main thread and I don’t get it at all why this is.

Any tips? Have you found out the same? How are you handling DynamicBuffers?
I’m on the verge of using system internal NativeHashmaps and associating them to a NativeArray.

1 Like

The main thread? Are you using Run()? That’s still 250k entities it has to iterate through, even if all it is doing is checking the length of each buffer is zero.

Can you share a profiler capture or at the very least a profiler timeline view with both the main thread and job threads? I’ve been doing a lot of work with dynamic buffers lately and haven’t seen any scaling issues.

2 Likes

I’ve done a test right (using tags) now which failed hard, I’ll post a capture or screenshot later. Sadly I don’t think it gives much clues, there’s no Dependency in the flow that could explain a stall. If it runs in parallel it’s nicely split up over 6 worker jobs.

So yeah, it runs on ScheduleParallel. I don’t think 250k is that much.
It doesn’t matter if I do anything in the job. Length check of zero, more code or no code at all. Just scheduling/running the job in any mode is enough.

How much entities can you go to? Done any stress tests? I mean obviously it’s not a problem with 10k or something small.

How many entities? That really depends on the algorithms and game logic being performed on them. I know for LSSS, I spawn 10k ships with 12 entities each and the simulation peaks with about 40k bullets which are single entities. So that’s 160k entities at a solid 60 FPS on my 5820k back when I was running dual channel RAM (I switched to quad channel and now it runs about 50% faster). I have benchmark levels that push far higher entity counts where my system starts to choke. Also right now I have 10k skeletons with 22 bones each and two extra entities for a total of 240k entities animating at 60 FPS. But if I switch to using dynamic buffers for the bone data, I get 25k skeletons. Same amount of data, same operations, less entities. But the reason that is faster is because I avoid random accesses when computing the hierarchy transforms.

Anyways, I need real data from you before I can provide any meaningful suggestions.

1 Like

Ok, cool. Thanks for that. How many entities could you get into one chunk? I can get 60 into 1 chunk where the buffers are. Maybe one of my problems is that the archetype is too big?

The example with the ships and 12 entities you’ve given is interesting as I’m also in the process of splitting up a very big algorithm. Are those 12 entities working on their own self-contained chunks or does an entity need access to the other ones as well?
That’s maybe my biggest problem right now, that the algorithm itself can be split into phases but overall it needs lots of data. I’ve split it to 3 entities each but in later phases I need a lot or of access to those 3 entities and their comps.
Well, I’ve to figure that out how to do that better, but I think it has not much to do with my DynamicBuffer problem right now unless the archetype being too big is really the bottleneck.

60 should be fine granularity. I wouldn’t expect more to improve performance by more than 25% when operating on dynamic buffers.

Most of them are renderers so the main cost is the transform hierarchy update. But I use a couple of them to mark bullet spawn points and I keep the sensor data and control data on separate entities.

1 Like

Random tip for DynamicBuffer.

If you’re not resizing it (Add) do a AsNativeArray before using it. Much faster and compiler can optimize it better if you don’t have to do

public T this[int index]
{
    get
    {
        return UnsafeUtility.ReadArrayElement<T>(BufferHeader.GetElementPointer(m_Buffer), index);
    }
    set
    {
        UnsafeUtility.WriteArrayElement<T>(BufferHeader.GetElementPointer(m_Buffer), index, value);
    }
}

public static byte* GetElementPointer(BufferHeader* header)
{
    if (header->Pointer != null)
        return header->Pointer;

    return (byte*)(header + 1);
}

on every index read/write

Might not seem like much, but benefits can be measured if you have a bunch of data in your buffers.

7 Likes

I’ve got a capture for you. Maybe it helps and you see something I don’t. It’s a capture with a Unity2020.3.20f1, develop build running on 250k entities that are able to cast spells.
I figured I could merge 3 systems working on the buffers into one. 2 are just clearing, 1 is checking if cooldowns are outdated and can be removed.

var clearJob = Entities
            .WithAny<AvatarCombatState>()
            .ForEach((DynamicBuffer<SpellCombatEffect> combatEffects, DynamicBuffer<TriggeredBuffer> triggerBuffer, DynamicBuffer<SpellCooldown> cooldowns) =>
            {
                if (combatEffects.Length > 0)
                    combatEffects.Clear();
                if (triggerBuffer.Length > 0)
                    triggerBuffer.Clear();

                if (cooldowns.Length > 0)
                {
                    var cooldownsArray = cooldowns.AsNativeArray();             

                    for (int i = cooldownsArray.Length - 1; i >= 0; i--)
                    {
                        //Debug.Log("Actually doing something");
                        var tmp = cooldownsArray[i];

                        if (tick > tmp.endTick + 1000)
                        {
                            // TODO
                            // prediction removes the cooldown before it's really finished
                            // due to rollback                  
                            cooldowns.RemoveAt(i);
                        }
                    }

                    cooldownsArray.Dispose();
                }
            }).ScheduleParallel(Dependency);

Not one of these buffers has any elements!
So the merging helped with the overhead. Each of those 3 systems added up to 1ms before and now it’s 0.68ms. Still a lot for no data. :smile:
Take a look at the flat portion of the capture, at the beginning there’s a combatEffectBuilder LambdaJob0 which is the merged systems now.
Also feel free to look at the spike where spells are casted. This is also a part that I’ve not figured out, the EntityCommandBuffer goes crazy as I need to instantiate the spells and 250k is a lot. There’s really only Instantiate used as I’ve created the archetypes in advance. Still, trash performance sadly :frowning:

The SpellCastSystem jobs are running on a chunk where 86 entities can fit in. Before the optimisation and splitting up of data only 11 could fit in. Didn’t change a lot in terms of milliseconds but it’s a lot better than before.

7586422–940525–EditorProfiling2.zip (6.1 MB)

Alright. It is a little difficult to get a sense of scale since your system is different from mine. But I’m going to assume your 250k entities all have transforms and are all moving every frame. If some are missing transforms or aren’t moving, then change filters may be taking effect. But under the assumption that this data isn’t biased, yeah your dynamic buffer systems are a little slower than I would expect (and would absolutely kill my machine :p).

First, clear your buffers regardless of whether or not they are empty. This simply sets an int to 0 so there’s no need to pay the cost of the if statement.

Second, don’t call Dispose when using AsNativeArray(). While the deallocation is a no-op, Burst may still be checking the allocator type.

Third, there’s a much faster algorithm for your buffer filtering. Here’s a naive implementation:

int w = 0;
for (int i = 0; i < cooldownsArray.Lengh; i++)
{
    var tmp = cooldownsArray[i];
    cooldownsArray[w] = tmp;
    w += math.select(1, 0, tick > tmp.endTick + 1000); // Is it possible to subtract 1000 from tick outside this ForEach()?
}
cooldowns.Length = w;

And fourth, if all your ECB does is spawn prefabs and initialize some components, this is a problem I have definitely explored optimizing and reached some pretty satisfying results: Latios-Framework/Documentation~/Optimization Adventures/Part 4 - Command Buffers 1.md at v0.4.2 · Dreaming381/Latios-Framework · GitHub

Try those out and post any speedups you observe!

1 Like

Also cooldowns array does not need a Dispose. The Dispose will be ignored in this case automatically, since the buffer owns the data not the array, but no need to run that code…

5 Likes

One additional note… DynamicBuffer can be quite heavy on the runtime safety checks usage. Since they are used on every data access. (Entities.ForEach on the other hand is able to just have one check for the whole component array in batch)

So if you are profiling in the editor, you might want to turn off jobsdebugger & turn off Safety Checks in the burst menu, to get a better approximation of the perf you get in the player. (Where those are always stripped out)

8 Likes

That’s all pretty tight! :smile: Very interesting optimisation, never would have thought about just writing valid cooldowns back and setting the length. I would have been under the assumption that the write operation is larger but then again, the buffer has to reorder and swap back anyway.

Thanks also for the posted article. Some very interesting stuff I can implement. It should be a solution for another problem I’m having. Basically I have 3 systems in SpellCasting and the first is the broad phase check. Problem is that depending on the outcome of the first phase, the other 2 shouldn’t run on those entities. I’ve tried the naive approach with an ECB that tags the entities for the other phase, which ended in failure. The ECB was taking way too long. But I can batch it up, that should help a lot instead of writing them as single AddComponent commands to the ECB.

Thanks Joachim! As I was always forgetting some of those settings I test now exclusively with develop builds and connect the profiler.

The question that is still open, I’d like to explore why and how to handle:
Running an Entities.ForEach on some DynamicBuffers is taking a lot of CPU time in the ECS framework, the code itself doesn’t matter much. It’s the time of data acquisition that I’m interested how to bring down.

I’m not sure if that’s just something we have to live with or that’s something where we can optimise further with changing our archetypes and data layout.

As mentioned in the initial post, InternalBufferCapacity set to 0 already improves performance. But I also don’t see the purpose of moving every buffer out of a chunk. But if it helps, I’ll take it. :sunglasses: There are some mechanics in ECS I’m just not familiar and aware yet, so any information is appreciated.

1 Like

Feel free to give it a shot, but I would personally avoid structural changes for a per-frame evaluation.

The code actually matters a lot, because the best way to reduce that time is to not acquire buffers you don’t need. The only way to do that is to look at the complete problem.

I’m going to make some assumptions about your problem which may or may not be true. If they are true, then great! If not, then this will serve as an example of why the complete picture matters and what kinds of performance optimizations are possible.

My assumptions are:

  1. Cooldown buffers need to be present at all times, because using ECB to add them when necessary isn’t practical due to the addition of sync points or the high volume of entities total.
  2. Changes to the cooldown buffer are infrequent (every second or so rather than every frame).
  3. Structural changes to the entities with SpellCooldown are also infrequent.
  4. There is no other system that uses ref DynamicBuffer in an Entities.ForEach that would break this optimization.
  5. A new component type can trivially be added to all entities with SpellCooldown, likely at authoring time.
  6. We can explore the world of IJobEntityBatch instead of Entities.ForEach.

It is with these assumptions that we can arrive to a solution which allows us to skip chunks without losing single-frame responsiveness. The solution is chunk components.

Chunk components are a super-weapon I have only recently learned to wield effectively. The best way to think of them is a per-chunk cache where you can store some metadata. For this case, I would have a chunk component like this:

struct ChunkSpellCooldown : IComponentData
{
    public int minTick;
    public bool allEmpty;
}

With this, you can determine whether or not you need to iterate through all the entities in a chunk using this formula:
needsIteration = ((!allEmpty && tick > minTick + 1000) || chunk.DidChange || chunk.DidOrderChange)

For chunks where you do iterate through all the entities, you just need to update the chunk component values when you are done.

Anyways, definitely try implementing some of our suggestions and share your results. And maybe that will help me come up with more ideas for you. :smile:

1 Like

Awesome post! Thanks for putting so much brain and writing power into it. I was on the verge of rewriting some systems to IJobEntityBatch. This now compels me in exploring it more. The earlier mentioned problem with the 3 phases where 2 can be skipped is a perfect usecase.

I was too unspecific, of course the code matters. It just doesn’t matter in my particular test-case because all the buffers are empty and stay empty. No triggers, cooldowns, etc are used in the stress test.

Hm, ok. How would you design this then? A set of entities either wants to cast a spell and is able to or unable because of a global cooldown. When possible, phase 2 starts with more complicated checks and after that passes phase3 can do the rest of the work to the point of spell instantiation.
As the overhead of just starting these phases I’m inclined to merge them again into 1 big job. Phase2 often just earlies out in the first line but still takes a lot of cpu time to execute. In the beginning it was just one big job but I wanted to explore this eventual path of optimisation.

You probably know this, if I use a ref in an Entities.ForEach it’ll always be written back even if there was no actual change, right? So the more refs I use in the query the more cpu time pre and post execution of the job it will take. I hope this assumption is true, it would make sense. It was the reason why I started splitting this up because it was one big mess of mostly unused write refs.

I would start by asking a lot of questions. What does “wanting to cast a spell” entail? Is it a boolean on an ICD? Is it a list of entities? Is it some formula we have to calculate? Does it come from another entity in the hierarchy? What makes the global cooldown “global”? Is it a singleton? Is it a system variable? What initiates it? What ends it? What are “Phase 2” and “Phase 3”?

No. It is not true. The ref is the address of the actual component in chunk memory. Not touching the ref means not touching that memory (although the pointer still needs to get computed for it and the change version for the chunk still gets bumped).

1 Like

Thanks for clearing up my wrong assumption. I can safely go back to a big job then without messing around with flags and what not. I think that’s also enough to clear up the first question of how to design this because I’ll try with IJobEntityBatch and change filters. After all, the more chunks I can safely ignore the more cpu time is freed up with unnecessary checks that end up in earling out of the job anyway and what was mainly ending up in the system not scaling that well.

@DreamingImLatios
I’m doing tests with DidChange and I’ve reached the conclusion that the answer you gave me about the ref in a system is either wrong or half true. I can’t proof that values are written back in every case but what is still happening is that the chunk version is incremented so at least the chunk is touched when using ref even when nothing is written back. This destroys any purposeful logic one could apply with DidChange when using Entities.ForEach. Using CDFE is working though!

Let me explain with this code:

using System.Collections.Generic;
using Unity.Collections;
using Unity.Entities;
using UnityEngine;
using UnityEngine.InputSystem;

public class DidChangeTestSystem : SystemBase
{
    List<Entity> entityList = new List<Entity>();

    List<Entity> changeList = new List<Entity>();

    private int currentvalue = 1;

    EntityQuery query;

    protected override void OnCreate()
    {
        for (int i =0; i < 500;i++)
        {
            var entity = EntityManager.CreateEntity();

            EntityManager.AddComponentData(entity, new ChangeTestComp1() { value = currentvalue });
            EntityManager.AddComponentData(entity, new ChangeTestComp2() { value = currentvalue });
            EntityManager.AddComponentData(entity, new ChangeTestComp3() { value = currentvalue });

            entityList.Add(entity);

            if (i < 10)
                changeList.Add(entity);
        }

        query = GetEntityQuery(typeof(ChangeTestComp1));
    }

    protected override void OnUpdate()
    {
        if (Keyboard.current[Key.Q].wasPressedThisFrame)
        {
            Debug.Log("Incrementing a value via EntityManager");

            currentvalue++;                        

            foreach (var entity in changeList)
            {
                EntityManager.SetComponentData(entity, new ChangeTestComp1() { value = currentvalue });
            }            
        }

        if (Keyboard.current[Key.E].wasPressedThisFrame)
        {
            Debug.Log("Incrementing a value via system");

            int changedCount = 0;
            currentvalue++;            

            Entities            
            .WithoutBurst()
            .ForEach((Entity entity, ref ChangeTestComp1 comp1) =>
            {                
                if (changedCount > 100)
                    return;

                comp1.value = currentvalue;
                //Debug.Log("Incrementing on " + entity + " new value: " + comp1.value);
                changedCount++;
            }).Run();
        }

        if (Keyboard.current[Key.R].wasPressedThisFrame)
        {
            Debug.Log("Incrementing a value via CDFE");
            currentvalue++;    

            var comp1Lookup = GetComponentDataFromEntity<ChangeTestComp1>();        

            foreach (var entity in changeList)
            {
                var tmp = comp1Lookup[entity];
                tmp.value = currentvalue;
                comp1Lookup[entity] = tmp;
            }
        }

        // Entities
        // .WithChangeFilter<ChangeTestComp1>()
        // .WithoutBurst()
        // .ForEach((Entity entity, in ChangeTestComp1 comp1) =>
        // {
        //     Debug.Log("Change on " + entity + " new value: " + comp1.value);
        // }).Run();


        var changeJob = new DidChangeTestJob()
        {
            comp1Handle = GetComponentTypeHandle<ChangeTestComp1>(true),
            entitiesHandle = GetEntityTypeHandle(),
            LastSystemVersion = LastSystemVersion
        };

        Dependency = changeJob.ScheduleParallel(query, 1, Dependency);
    }

    public struct DidChangeTestJob : IJobEntityBatch
    {
        [ReadOnly]
        public EntityTypeHandle entitiesHandle;
        [ReadOnly]
        public ComponentTypeHandle<ChangeTestComp1> comp1Handle;

        public uint LastSystemVersion;

        public void Execute(ArchetypeChunk batchInChunk, int batchIndex)
        {
            if (batchInChunk.DidChange<ChangeTestComp1>(comp1Handle, LastSystemVersion))
            {
                var comp1s = batchInChunk.GetNativeArray(comp1Handle);
                var entities = batchInChunk.GetNativeArray(entitiesHandle);

                Debug.Log(string.Format("JOB - Change occured in chunk " + batchIndex));

                for (var i = 0; i < comp1s.Length; i++)
                {
                    var comp1 = comp1s[i];
                    var entity = entities[i];

                    //Debug.Log(string.Format("JOB - Change on {0} new value: {1}", entity, comp1.value));
                }
            }
        }
    }

    public struct ChangeTestComp1 : IComponentData
    {
        public int value;
    }

    public struct ChangeTestComp2 : IComponentData
    {
        public int value;
        public int value2;
        public int value3;
        public int value4;
        public int value5;
        public int value6;
        public int value7;
    }

    public struct ChangeTestComp3 : IComponentData
    {
        public int value;
        public int value2;
        public int value3;
        public int value4;
        public int value5;
        public int value6;
        public int value7;
    }
}

240 entities fit into the Archetype. 500 will be created which results in 3 chunks.
Pressing the Q key, the EntityManager will change the value on the first 100 entities that were created.
Pressing the E key, a system will iterate on all ChangeTestComp1 and the first 100 will change their value.
Pressing the R key, the same 100 entities will be changed, only this time with CDFE.

EntityManager and CDFE are working as expected and only 1 chunk is chagned.
Entities.ForEach is not, every chunk version is incremented, resulting in iterating over 3 chunks.

1 Like

That’s what I meant by this:

And why I mentioned this:

So no values are ever “written back” because when you write to the ref, you are writing directly to the component in memory. Unity has no way of knowing whether you actually do anything with the ref or not, so when you ask for the ref, they always bump the change version which causes future DidChange queries to be true.

1 Like

Ah, alright, after testing it’s a lot more clear to me what you meant. I didn’t understand the implication of the chunk version and how it works with DidChange or the WithChangeFilter and that they are the same really. I had hopes the version integer of an entity is used. The way it works seem very crude and, not to offend anyone, somewhat useless in practice.

Another selected write test with an IJobEntityBatch results in the same thing. When you provide a ComponentTypeHandle and use batchInChunk.GetNativeArray the chunk version gets incremented regardless if you write to anything or not. A double buffer with a Read only handle and R/W handle doesn’t work. Is there a workaround? I think in future versions using the same comp in read and read/write should work. That way chunk versions could get bumped only on real writes. I’m not sure why it doesn’t. Seems like a safety measure that’s getting in the way.

So the conclusion is, DidChange and WithChangeFilter works for the following methods of writing data:

  • EntityManager.SetComponent
  • SystemBase.SetComponent, same as CDFE
  • EntityCommandBuffer

Everything else is not playing nice with it which makes it really hard to use correctly. Performance is of course best with using ref, like in a way that iterating with ref is faster than optimizing with CDFE and a change filter. This write test is done on 500k entities. REF set is blazing fast even on main thread.

This works for me. I use [NativeDisableContainerSafetyRestriction] on the write handle and [ReadOnly] on the read handle (plus setting the correct flags when requesting them in OnUpdate).

1 Like