Best way to cache and re-access entity queries in the same frame?

I have a system using SystemBase from Entities 0.11

I manually call Update on this system multiple times in a frame (its part of a simulation and a simulation may require multiple updates per frame).

The Entities.ForEach is too costly, especially when the system may only have 1 entity to work on per update. Even with an early return and no work is being done, calling the ForEach adds up.

I'm looking for recommendations on how to cache the ForEach query every frame, and reuse it.
It would work like this:

  • PreSimulation -> system calls and caches an array of EntityQuery.CreateArchetypeChunkArray
  • SimulationTick -> system Update is triggered, it may manually iterate the array and access components using GetArchetypeChunkComponentType or maybe even create a (non-parallel) job
  • Repeat SimulationTick x times.
  • PostSimulation -> system disposes of the nativearrays and anything else.

Note:

  • I do not expect new entities to be introduced during this cycle.
  • I do not expect components to be removed or added during this cycle.
  • I do expect to modify component data during this cycle.
  • I want to keep to option to Burst where possible, but I'd probably leave it to the user.

Does this sound reasonable? Suggestions for alternatives besides CreateArchetypeChunkArray ?
And while I'm here, may I ask how do you modify component data while using chunks, can you not get a reference to that component somehow to avoid calling SetComponent?


The NativeArray returned by ArchetypeChunk.GetNativeArray() is the direct memory of the components in the chunk, not a copy. That means if you write to the NativeArray, you are writing to the entities' data.

I'm surprised the "Entities.ForEach" is so costly for you. I would love to see a profiler timeline snapshot. How many times are you updating the system per frame? But if this is truly the bottleneck, you found the right API. You can pass that array and the the type handles into an IJobFor with [BurstCompile] if you need parallel bursted jobs.

oh cool, so if I access the index of the chunk with the accessor, it's a reference? that's awesome.

I can update it anywhere from 1 to 32 times a frame (worst case). At 5 times on a Pixel 2 Android, it can take 0.8ms with no processing too. Multiple that by 25+ systems and my game is struggling to be smooth unfortunately (Hitting around 30ms just for the simulation systems). Most of these systems just check a component and early out. It's rough!

FYI, you can use Entities.WithStoreEntityQueryInField(ref EntityQuery).ForEach()
So you can use the Query in OnUpdate() or OnCreate() even before this Entities.ForEach().
then generation of the Query is codegened to a function before OnCreate().

1 Like


Ah. Mobile. That explains a bit.
Other things to watch out for, there's some fancy checking Unity does before running the system. You might be able to reduce that cost using [AlwaysUpdateSystem] which early-outs that process. Also, job scheduling can be expensive. So using Run() will bypass that overhead while still using Burst. Lastly, if you can compute whether a system needs to run before it runs and don't rely on OnStartRunning or OnStopRunning, you can manually update systems and just don't call Update when you don't need them to run.

1 Like

Just to follow up I got a working version using the chunks but boy is the code long and annoying. Rewriting 45+ systems like this and future systems was too much to do. I plan to consider a "controller" or something that will manually update systems it thinks it needs to run instead

[quote]
The Entities.ForEach is too costly, especially when the system may only have 1 entity to work on per update. Even with an early return and no work is being done, calling the ForEach adds up.
[/quote]
Entities.ForEach().Run(); using SystemBase is very fast / very low overhead, in particular for very small entity counts.
Much faster than asking a query to allocate an archetype chunk array and processing it manually.

So please continue to write simple code, in this case it is also the fastest.

2 Likes

In my use case it's faster to cache the chunk array for multiple updates rather than using .ForEach.Run multiple times. I've also had unity devs straight out say that ForEach has overhead issues.

That is outdated information. SystemBase.Entities.ForEach.Run() is definately faster than allocating queries.

If you feel that this doesn't match what you are seeing, please write a simple loop in your game, measure it and show the comparison results here.

It is not faster than doing a single query vs multiple queries in a single frame. If you call ForEach repeatedly vs caching the chunks and repeatedly doing work, the latter is faster.

Create 20-50 systems in a world, each with different queries in a ForEach.

Try to call update on each system 5-10 times in a frame.

Then convert one or more systems away from ForEach to using chunks. So no more ForEach.

You will see the non ForEach systems speed up by 10% or more. Assuming that any burst foreach workloads are reconverted to burst jobs too of course.

Please post the sample code you used for both versions and the performance numbers you measured based on it.

Well as @OmegaNemesis28 don't want to provide measurements, I did this for him to prove that @Joachim_Ante_1 words correct.

Two systems for tests. One of them using BurstCompiled IJob for iterating cached chunks in OnUpdate and increment one component value. We cache chunks only once in measurement iteration (let's assume this as a frame) then update system 100 times, the second one just uses Entities.ForEach without any our explicit optimisations every OnUpdate call, and also just increment one component value, we also update this system 100 times in measurement iteration.

using Unity.Burst;
using Unity.Collections;
using Unity.Entities;
using Unity.Jobs;

namespace Tests
{
    public struct ComponentForCache : IComponentData
    {
        public float Value;
    }

    public struct ComponentForForEach : IComponentData
    {
        public float Value;
    }

    [DisableAutoCreation]
    public class CachedChunksSystem : SystemBase
    {
        private NativeArray<ArchetypeChunk> _cachedChunks;
        private EntityQuery                 _queryToCache;

        public void CacheChunks()
        {
            _queryToCache = GetEntityQuery(typeof(ComponentForCache));
            _cachedChunks = _queryToCache.CreateArchetypeChunkArray(Allocator.TempJob);
        }

        public void ClearCache()
        {
            if (_cachedChunks.IsCreated)
                _cachedChunks.Dispose();
        }

        [BurstCompile]
        private struct IterateCachedChunksJob : IJob
        {
            public NativeArray<ArchetypeChunk>            CachedChunks;
            public ComponentTypeHandle<ComponentForCache> ComponentForCacheType;

            public void Execute()
            {
                for (int i = 0; i < CachedChunks.Length; i++)
                {
                    var componentArray = CachedChunks[i].GetNativeArray(ComponentForCacheType);

                    for (int j = 0; j < componentArray.Length; j++)
                    {
                        var updatedValue = componentArray[j];
                        updatedValue.Value += 1.5f;
                        componentArray[j]  =  updatedValue;
                    }
                }
            }
        }

        protected override void OnUpdate()
        {
            new IterateCachedChunksJob()
            {
                CachedChunks          = _cachedChunks,
                ComponentForCacheType = GetComponentTypeHandle<ComponentForCache>()
            }.Run();
        }
    }

    public class ForEachSystem : SystemBase
    {
        protected override void OnUpdate()
        {
            Entities.ForEach((ref ComponentForForEach componentData) =>
            {
                componentData.Value += 1.5f;
            }).Run();
        }
    }
}

Performance test with warmups for clear numbers. Synchronous compilation for Burst enabled, all safety checks, leak detection, jobs debugger disabled. 1000 measurements, 100 iterations per measurement, each call system update 100 times, 10000 entities

using NUnit.Framework;
using Unity.Entities;
using Unity.PerformanceTesting;

namespace Tests
{
    public class PerformanceTestGathering
    {
        [Test, Performance]
        public void CachedChunksPerformance()
        {
            InitializeTestWorld<CachedChunksSystem, ComponentForCache>(10000);

            var systemWarmup = _testWorld.GetExistingSystem<CachedChunksSystem>();
            systemWarmup.CacheChunks();
            systemWarmup.Update();
            systemWarmup.ClearCache();

            Measure.Method(() =>
            {
                var system = _testWorld.GetExistingSystem<CachedChunksSystem>();
                system.CacheChunks();
                for (int i = 0; i < 100; i++)
                {
                    system.Update();
                }
                system.ClearCache();
            })
            .MeasurementCount(1000)
           .IterationsPerMeasurement(100)
           .SampleGroup("CachedChunksPerformance")
           .Run();

            DisposeTestWorld();
        }

        [Test, Performance]
        public void ForEachPerformance()
        {
            InitializeTestWorld<ForEachSystem, ComponentForForEach>(10000);

            var systemWarmup = _testWorld.GetExistingSystem<ForEachSystem>();
            systemWarmup.Update();

            Measure.Method(() =>
            {
               var system = _testWorld.GetExistingSystem<ForEachSystem>();
               for (int i = 0; i < 100; i++)
               {
                   system.Update();
               }
            })
            .MeasurementCount(1000)
            .IterationsPerMeasurement(100)
            .SampleGroup("ForEachPerformance")
            .Run();
            DisposeTestWorld();
        }

        private World _testWorld;

        private void InitializeTestWorld<TSystem, TComponent>(int entitiesCount)
            where TSystem : SystemBase, new() where TComponent : IComponentData
        {
            _testWorld = new World("Performance Test World");
            var simulationGroup = _testWorld.GetOrCreateSystem<SimulationSystemGroup>();
            var system          = _testWorld.GetOrCreateSystem<TSystem>();
            simulationGroup.AddSystemToUpdateList(system);
            simulationGroup.SortSystems();

            var entityArchetype = _testWorld.EntityManager.CreateArchetype(typeof(TComponent));
            for (var i = 0; i < entitiesCount; i++)
            {
                _testWorld.EntityManager.CreateEntity(entityArchetype);
            }
        }

        private void DisposeTestWorld()
        {
            _testWorld.Dispose();
        }
    }
}

And results, where you can see that ForEach faster than manually caching and iterating chunks (0.17 median against 0.26 median)
ForEach:
6451028--722519--upload_2020-10-24_1-45-36.png

Caching and iterating chunks:
6451028--722522--upload_2020-10-24_1-45-44.png

Don't even mention that cache version require much more code.

7 Likes

Thanks Eizenhorn.

On top of this, caching the ArchetypeChunk array like this is not safe & you have to have code that invalidates the cache when structural changes occur. You can do this using EntityManager.Version. In practice however, caching an archetype chunk array over multiple frames is a quite an unrealistic expectation. Most games instantiate / destroy at least a couple of entities every frame... Meaning that such caching is essentially completely pointless. And that is when performance goes from being equal in the case of Entities.ForEach to being significantly better when using Entities.ForEach

2 Likes

Watch your words. Never said I don't want to provide measurements. You can't just say "ask for a repro" and expect a sample project overnight. I have better things to do. You swoop into the thread less than a few hours later after the last post and act like I'm refusing to share information or something? That's just rude.

These numbers do not match mine.
I haven't drilled into your code, but

  • just looking at it briefly shows you only have 1 system and a SUPER simple use case with 1 component, no filters or anything complex.
  • I also only see 1 job, where every system would be creating different jobs with various workloads too. Some may be burst, some may not be.
  • You're updating the system in loops sequentially. When in reality, they'd be updated along all the other systems in a master loop. This can easily invalidate your numbers.
  • You have 10000 entities I think? What do you get with 1 entity, which is more inline with my use case (or up to 10)

This is not a proper comparison, its very barebones and very naive.

[quote]
Don't even mention that cache version require much more code.
[/quote]

Yes I stated this before which is why I do not like it. But necessary measures may mean I have to do this now since my performance can't suffer this much.

[quote]
In practice however, caching an archetype chunk array over multiple frames is a quite an unrealistic expectation.
[/quote]

No one is caching archetype chunk arrays over multiple frames. Unless his code is and I haven't read it thoroughly enough. Whioch means its even more of an invalid comparison. I thought I stated this before. Same with "Meaning that such caching is essentially completely pointless." statement. No one is maintaing a cache across frames with new entities. I specified my use case earlier.

According to @Micz84 's test, Caching read-only native container data in a temp container (even stackalloc) only makes it slower, as Burst is doing an excellent job in that case.
Cache ReadWrite/WriteOnly data in stackalloc memory block and write data back in batch with MemCpy will be faster than setting data one by one directly to container memory.
https://discussions.unity.com/t/813348
And Chunk is also a NativeContainer.
Burst can somehow keep data in the cache as much as possible.
That one extra MemCpy to cache data will only make it slower.
But if cached data is used sparsely across several systems that access large chunks of memory over different locations. Burst probably would not be able to help.
In that case, caching chunk data is unsafe, as data in chunk could have been updated.

Because of all the impatience, I did something super quick and it still doesn't cover my use case at the worst possible scenario. This is a super simplified workload and I still see that ForEach is slow. Like I said, caching the chunks ends up being just as fast or faster on my devices. The only exception to this is if I flag [AlwaysUpdate] on the system, this speeds it up considerably but I need to investigate if I can use that in my actual code.

Info/Conditions:

  • com.unity.entities@0.11.1-preview.4

  • 10 entities (all match the use case, this is too optimal/naive for real performance numbers, it would actually be worse with more entities in a real use scenario)

  • 10 frames of updates

  • 64 ticks of Update per frame

  • I only have 6 ECS systems, all doing largely the same thing and poking at the same memory/entities. (Also too optimal/naive, it would be nice if I could create a bunch of systems of the same type doing dummy work but annoyingly ECS worlds are type-keyed I think which means its 1:1, would have to create dummy classes.)

  • Burst compiling = on
    Job safety checks = off

  • Notice the code of the "jobs" all have early out conditions and out of box none of the entities actually end up doing anything (it sees value == 0.0f return/continue in the loop)

Method:
Attach profiler, record, press button to begin test, wait for cube to disappear, stop profiling. Open profile analyzer, pull data, highlight the test (it will be one big block of frame time, something like 10-20 frames to highlight), use name filter "test." with the period at the end.

Unity 2019.4.7f1 In-Editor Windows

  • UpdatePretendEntities (no ECS, just MonoBeh) = 0.02ms (this is the ideal performance)
  • ForEachSystem.OnUpdate = 0.23ms
  • AlwaysUpdateForEachSystem.OnUpdate = 0.17ms
  • ForEachSystemNoBurst.OnUpdate = 0.30ms
  • ChunksSystem.OnUpdate = 0.07ms + 0.11ms (for PreLoop to cache) = 0.18ms

  • ChunksJobSystem.OnUpdate = 0.13ms + 0.11ms (for PreLoop to cache) = 0.24ms

  • ChunkJobSystemNoBurst.OnUpdate = 0.30ms + 0.11ms (for PreLoop to cache) = 0.41ms

Android Pixel 2

  • UpdatePretendEntities (no ECS, just MonoBeh) = 0.16ms (this is the ideal performance)
  • ForEachSystem.OnUpdate = 0.88ms
  • AlwaysUpdateForEachSystem.OnUpdate = 0.74ms
  • ForEachSystemNoBurst.OnUpdate = 1.03ms
  • ChunksSystem.OnUpdate = 0.37ms + 0.47ms (for PreLoop to cache) + 0.01 (for PostLoop to dispose) = 0.85ms

  • ChunksJobSystem.OnUpdate = 0.70ms + 0.47ms (for PreLoop to cache) + 0.01 (for PostLoop to dispose) = 1.02ms

  • ChunkJobSystemNoBurst.OnUpdate = 1.03ms + 0.47ms (for PreLoop to cache) + 0.01 (for PostLoop to dispose) = 1.51ms

The quickest take away is that ECS here kills performances whether you're using ForEach or Chunks, Monobehaviours win. Of course this is just with 10 entities, rather than a million, but like I said that's how my game operates right now. There's usually only 1 entity these systems look at. My game does not have or will have many entities, it's not a battle royal or anything.

For 1 system to take 0.88ms is crazy to me. Yes, it's unusual to call update on the system 64 times. But even at 1/4 of that, it shouldn't be breaching 0.20ms especially when the systems are not actually doing work (just conditional check). I have 70+ systems now that have to do this every frame because it's a simulation, that's 61.6ms at minimum (assuming 64hz) :(
Non ECS would be 11.2ms for comparison

Disclaimer: besides the test favoring ForEach for several noted reasons, it is worth mentioning I could have mistakes here. I rushed this since I didn't appreciate the rudeness I perceived. Lots of ways to make the test "closer" to my use case such as adding lots more systems, mix/match burst use, add more entity archetype variation for the chunks, don't use the entities sequentially, introduce mixed branching of logic.

Profile files:
https://www.dropbox.com/s/qz6d8piw512d2fq/profiles.zip?dl=0

Project/Code here:
https://www.dropbox.com/s/3ii8llny18pcxwr/test.zip?dl=0

How outdated by the way? Less than a month? In the unity slack channel for DOTS it was recent, beginning of October I think. @Joachim_Ante_1

[quote=“Lieene-Guo”, post:15, topic: 811496]
According to @Micz84 's test, Caching read-only native container data in a temp container (even stackalloc) only makes it slower, as Burst is doing an excellent job in that case.
Cache ReadWrite/WriteOnly data in stackalloc memory block and write data back in batch with MemCpy will be faster than setting data one by one directly to container memory.
https://discussions.unity.com/t/813348
And Chunk is also a NativeContainer.
Burst can somehow keep data in the cache as much as possible.
That one extra MemCpy to cache data will only make it slower.
But if cached data is used sparsely across several systems that access large chunks of memory over different locations. Burst probably would not be able to help.
In that case, caching chunk data is unsafe, as data in chunk could have been updated.
[/quote]

That perf of that makes sense, kind of, to me. Do note that not everything in this can be bursted though. It’s up to the user, but a lot of these systems can’t be bursted due to poor design decisions outside of my control, long story. But I’ve observed this even with burst. The burst jobs could literally be looking at 1 entity, checking 1 float, returning and they’ll take end up taking 0.10ms I’ve seen which is killer.


Look like there are two major reasons.
1. SystemBase pre-update checks (Query count check, required singleton check blah blah...)
2. Job schedule overhead.

For reason 1. I'm waiting for the unmanaged system. Bursted unmanaged system will be much faster.
For reason 2. A manual entity count/chunk count check, and a Run/Schedule/ScheduleParallel switch could make it better.
Generally, as the game designer, you should be aware of what type of entity is rare and can be updated by run and what should be batched by ScheduleParallel.

By adding [AlwaysUpdate] attribute, it is just up to you to decide if the job should Run/Schedule/ScheduleParallel or skipped totally.

I am not sure, if ECS is a good approach, if having only few handull entities to deal with. ECS shines with high volume of data. There is some small overhead of running systems. But for large count of entities, it is neglegable. Also, you maybe not require to run every system in every frame.

But maybe Instead stick with jobs and burst?

1 Like

As Lieene said, the solution here is :

unmanaged ISystemBase which allow fully bursted update calls. We have done a lot of refactoring in Entities to allow for this in the last couple releases. (EntityManager / SystemState / EntityQuery etc are all structs & burstable now...)
What is missing is code-gen for Entities.ForEach, but we are also almost there with that.

This will significantly reduce the overhead of System.OnUpdate including the cost of invoking Entities.ForEach.Run

Our intention here is very much that a single entity + single system OnUpdate + ForEach.Run should be the same or better than the cost of MonoBehaviour.Update. Obviously where the benefits of dots kicks in is in having more than 1 thing, but we fully realise that there are plenty cases in games where there is just one of a thing & the minimum bar for that is that it is no worse than MonoBehaviour.Update. (Lets note however that your MonoBehaviour example has a direct function call to a Test method in an innerloop. That is definately not how MonoBehaviour.Update works and is not a fair comparison)

It's possible that if you cache the archetypechunk array and then reuse it 64 times you can get some speedups.
I wouldn't recommend refactoring a bunch of code to such a pattern unless you are shipping very soon & you absolutely need exactly those speedups right now.

In this case, you probably just want to trust me that this particular codepath will become much better optimised in the next coming months.

9 Likes