Roadblocks to writing performant pure ECS, open to suggestions

Hey everyone!

I’ve been silently following these forums collecting all the information I could about Unity’s ECS implementation. But every time I try to sit down and build something in pure ECS I run into roadblocks and have had to resort to hybrid ECS despite not really needing any of Unity’s default components.

I ordered these roadblocks from pettiest to most crippling, so skip to the end if you don’t have much time but want to be helpful! I would really appreciate tips to get around these issues and patch my misunderstandings!

Roadblock 1: bool and enum
I’m really only putting these here for completeness’s sake. I know how to work around them, but I have friends who would try to use these, get compile errors, follow a hybrid ECS example, see that their code works now, and then later wonder why their performance is not great.

Roadblock 2: False Sharing
If I were to only write safe C# and use Unity’s ECS APIs, would I ever still have to worry about this? If not, where is this prevented? In Burst? In IJobParallelFor? Something else? In C++ I usually have to do a bunch of compile-time magic to avoid this, so it’s a question I would rather know the answer to than wait to bite me later in development.

Roadblock 3: ChangedFilter
Most of the time when I want to use ChangedFilter, it is because either the operation should only take place if the data actually changed, or because the calculation is really heavy. But because ChangedFilter only applies to chunks, I have to keep a copy of the data from the previous frame to compare against. That by itself isn’t an issue, because I can keep an ISystemStateComponentData to track that, and I usually care about what the previous frame’s value was anyways. So for ints and floats, this is optimal. But when I need to react to a float4x4 or a struct with several float3 fields, things start to become sub-optimal.

Possible Solution?
It would be nice to have a component type that would carry an internal flag as to whether or not it was changed. It would be similar to how Hi-Z + Early-Z work in graphics land.

Roadblock 4: Sparsing
This is an issue that really only pops up due to me trying to work around other issues. But long story short, between my magic system and rendering, all my entities are scattered over many sparse chunks and now my collision detection and movement systems are running a fair bit slower.

Possible Solution?
I can think of a solution where I have entities that just have components containing sub-entities for each of the different engine systems (magic system, collision system, movement system, rendering system, ect). That way, each engine system has its own set of archetypes it can optimize for without compromising the performance of other engine systems. However, now I have a bunch of different entity IDs that are all actually the same entity. Is this the best solution?

Roadblock 5: Memcpy everywhere
So the traditional way to change the state and behavior of an entity in an ECS is to add or remove components. But every time you do this, the entity has to be copied out of the chunk into a new one. For small, lightweight games that most people have tested so far, this is plenty fast and easy. But for larger games where we could expect a couple hundred components per entity with an average size of float3 per component, now every time a new component gets added or removed, that’s over 1 kB that needs to be copied. Multiply that per system per modified entity per frame, and suddenly there’s a performance concern. I’m not sure if DynamicBuffer makes this worse.

Possible Solution?
Splitting entities the way I described earlier is one way I could solve this. But that requires good game architectural planning. Something simpler would be a hot/cold split with something like an IDynamicComponentData. One way to do this would be to create a separate chunk for these dynamic components. Since most of these components are tags anyways, copying them around would be fairly cheap. It would break vectorization as for each dynamic entity you’d have to check if the cold data entity is the same, but most likely the number of clock cycles wasted on that iteration would be less than the memcpy. I could also imagine restricting the dynamic components to just empty tags and storing them as 1 bit booleans per entity in the chunk. The masking checks could be really fast!

The killer
Roadblock 6: Dependency confusion with jobs
This one is an absolute show-stopper for me! Please help!

The problem I have is that JobComponentSystem executes in two different points in time. It executes for the setup and scheduling of the jobs, and it executes the job itself. Now each of these execution points could have dependencies on Component Systems, the setup portions of JobComponentSystems, and the actual jobs themselves. I want to specify these dependencies properly with the minimal amount of known information and code. Here’s an example case:

I need to write a ScatterMove JobComponentSystem. In the job setup on the main thread, the ScatterMove system checks a PanicTimer component on an entity that was written by a ComponentSystem. If the timer is less than 20 seconds, I schedule the ChaosScatter Job. Otherwise I schedule the SmoothScatter Job. Both of these jobs take Position components, Velocity components, and a separate NativeArray which is a lookup table used to create the scattering patterns. The Position and Velocity components were written to by other jobs earlier in the frame. The NativeArray is picked from a List of ScriptableObjects (you could also think of this as a NativeList of entities with SCDs, works the same way) based on an index calculated by the UpdateMood JobComponentSystem which uses a bunch of game statistics to eventually write a singular index on a singular component.

So the ScatterMove scheduling on the main thread requires the PanicTimer value and the Mood index, but the jobs processing the position and velocity components don’t have to be finished.

The job only needs to run after the Position and Velocity jobs are finished, but doesn’t need to lock the PanicTimer or the Mood components while it is processing.

How would I schedule this? And would I have to know exactly which systems were the last to touch Mood, PanicTimer, Position, and Velocity?

Possible Solution?
If I had to build my own ECS, I probably would make it a rule that all automatic dependency tracking only worked on components. So any data shared between systems would have to be stored in components. I would need more Native types than just DynamicBuffer to be storable in components and accessible in jobs. Then I would have attributes to specify a group the ScatterMove system would run in, some ordering attributes for within the group, and then a list of component group dependencies for the scheduling as well as another list of component group dependencies for the actual job. All I would have to do is look at these dependencies at startup and build two dependency graphs. One to construct the player loop and the other to manage JobHandles.

Do any of my questions make sense? I can sketch up images and post code samples if that helps, but I didn’t want to make this initial post any longer than it already is.

Roadblock 6: Dependency confusion with jobs
If you wanted to put together a simple example project that demonstrates your example case exactly, I could take a look and give more precise feedback.

First of all. Thats some great feedback. Thank you.

Agreed. We will get that fixed soon. I agree its quite annoying and also pointless. We want with default definition of “blittable” but that definition in C# is wrong for our purposes.

  • ArchetypeChunk API’s have zero false sharing.
  • IJobProcessComponentData in current release splits based on iteration index, meaning that two chunks might get modified by different jobs. This has already been fixed on master and in next release there is zero false sharing in IJobProcessComponentData
  • ComponentDataArray has false sharing, and generally if you want the best performance should be avoided.

We looked at doing change tracking different ways.

Our conclusion was that for majority of cases, simply doing early out and then efficiently reprocessing the full chunk is fasted when you take into account all the overhead of tracking exact changes.

So for the type of code we are writing, we have not found any need for precise change tracking yet. I understand sometimes you need to do large heavy work on change, in that case its useful to have.

Our implementation for precise tracking would probably do comparison against a copy of the ISystemStateComponentData component as you propose. So at least in terms of performance that is possible to do right now. It’s just not very convenient. Perhaps we should provide a special convenience layer precise reactive system code that does the comparison for you. But overall i don’t think this is a blocker. If you use Unity.Mathematics to do SIMD & branchless comparisons, and combine it with early outs per chunk, I am pretty sure that gives you quite scalable performance.

One question here. While this is absolutely accurate and great feedback. I am curious why those issues make you switch to hybrid. It seems to me especially items 2 - 6 will make performance even worse in practice if you use hybrid mode instead. Am I missing something here?

I think a practical way to relieve pressure from copying stuff is allow us to sync off the main thread. It would add a bit of complexity and most probably wouldn’t use it, but it seems like the better bang for buck once you start hitting diminishing returns on copying strategies. At the end of the day most games won’t be fully utilizing cores, but we do almost always fight for time on the main thread.

We actually have never seen the memcpy become an issue. We had some other issues related to shared component searching for the right chunk previously that has been fixed.

Is this just assumption or real world information?

Of course if you have 100k entities that you add / remove components from every frame, thats an issue. But whats the realistic amount of add component / remove component calls you would like to be able to do do every frame?

I actually have no idea if it’s memcpy. Actually I would think not. But I do know sync points on things like entity creation are very expensive. It’s pretty much the sole reason I don’t use ECS for some things. Working around it means not using idiomatic ECS, and at that point, well might as well just stick with jobs and my own patterns which don’t sync anything on the main thread.

I’d love to go pure in my ECS experiments but as far as I know some things like audio, cinemachine, animation, physics are working still only in the MonoBehaviour/GameObject world. And the for real world projects necessary integration to the classic part is obviously costing performance.

Currently, I think of mitigating the problem that I do as much as possible in a pure simulation layer and sync the results to a presentation layer which would at present contain the integration to classic. At the same time, I would try to limit the number of entities with classic components by creating and destroying the classic parts when a certain distance from the player is reached. Entities that are too far away, obviously don’t need audio.

Regarding memcpy, I thought its use in ecs is so fast that it doesn’t matter that it is happening frequently.

Thats indeed the expected and valid reason to use hybrid.

This is what I have been doing and I have to say that it works perfectly well and is much more flexible for me than the official suggested hybrid approach. The only performance issue is of course the sync point between the pure ECS code and the GameObjects.

So for an FPS I’m planning, I will have up to 24 player entities each having roughly 800 bytes of data. This data includes input action buffers, stats, ability parameters, asset table indices, and a bunch of float3 and quaternions for tracking and IK purposes. I’m expecting roughly 10-20 components to be added and removed per frame per character, many of them being tags for proximity-driven IK states or for the AI systems.

I also expect to see about 100k entities where 10 different systems will add or remove a component to about 2% of the entities every frame. These entities will be anywhere between 20 to 100 bytes.

I haven’t actually tested this in Unity, but memcpy has bitten me before in some C++ projects.

2 through 5 are more just scalability concerns than blockers. For me, it is easier to write overly simplistic code that works and then switch to the most optimal solution later than to write optimized code now that has a scalability pitfall and then try to refactor to another optimized solution.

6: Dependency management is the real blocker for me. As much as I want performance, I also want to be able to predict what will happen with my code and know that things will execute in the right order rather than go into play mode and hope it works. With hybrid ECS, I can do that. But with jobs right now I can’t. There’s just too much magic with system execution order attributes and the JobHandle passed into JobComponentSystems. And doesn’t some of this stuff break if you don’t use the default world initialization? Would I be better off disabling default world initialization and building my own dependency graph systems?

I’ll try to put together a case study later today on this, but no promises. It will be hard for me to come up with something that doesn’t just coincidentally work because the project is small. But really I think I just need a more in depth explanation of what dependencies Unity handles in which use cases and what dependencies I should handle using which APIs.

Most of the time I have found that I don’t actually need to modify references at runtime, but instead need to swap which asset an entity is using for a particular frame. So for that I build asset tables with reference counting and then store the indices inside of components. It’s worked well for me in C++ and I see no reason why it wouldn’t work in Unity ECS. I believe I could also use it with pooled GameObjects too.

That would jobify pretty well too if I knew how to get a ComponentSystem to run after all jobs and other ComponentSystems touch a particular IComponentData type. I wouldn’t mind open sourcing some of that code if I can get this JobComponentSystem dependency management stuff straight in my head.

If you want to use hybrid then I suggest that you keep the default world initialization and just call:

World.DisposeAllWorlds();

This way you still have the automatic registration of the hybrid ECS injection hooks that are used e.g. by the GameObjectEntity component. After that you can configure the player loop to your hearts content.

I’m still learning the job system myself but I think that you can call Complete() before proceeding with another system.

// Schedule the job
JobHandle handle = jobData.Schedule();

// Wait for the job to complete
handle.Complete();

Dependency handling can get difficult to reason about. The rules are consistent it’s that you have a lot of complexity that is necessarily at runtime. For example you can have multiple different dependency graphs for the same set of systems if those systems don’t all fire consistently. Resulting in sporadic errors.

My advice is stick to granular systems as a way to reduce this complexity.

This is equivalent to running a multi-threaded ECS without a job system. It works decent when most of your systems can run wide foreach algorithms and you don’t want to deal with the complexity of a full job system. It’s easy for beginners too, as there is no such thing as dependencies other than system order. But it falls apart when you have algorithms that don’t parallelize well but could be run in parallel to each other. Plus you pay the cost of thread synchronization at the end of every system.

But Unity has a proper and awesome job system and I want to use it.

I haven’t tested it myself (because I haven’t had the need so far) but with the dependencies properly set up it should be enough to call Complete for the last job in the specific job chain which you know touches that particular ComponentData.
After that point, you know they are all completed. I don’t know if you can pass a jobhandle from one system to another. Perhaps with a system injection in another system.
https://github.com/Unity-Technologies/EntityComponentSystemSamples/blob/master/Documentation/content/ecs_in_detail.md#injecting-other-systems

I’ll repeat request from Mike again.

Just a quick note. The way job dependency management works is based on very simple principles. Our basic rule is that jobified code should not affect behaviour. (Determinism by default principle)

The order in which systems run is what determines the behaviour. The jobs are simply scheduled and guaranteed to have the right dependencies to ensure that behaviour is 100% exactly the same as if you ran the code on the main thread.

So if you want to take full control over the execution order. Maybe the best approach is to just call Update(); on each system manually. For our own FPS sample game that is being developed, we are using that approach. Mostly because server / client has different systems that need to run and the guys working on it prefer an approach where there is explicitly typed system order. Once you do that, system update order is fully in your control. And whether or not you execute jobs or not should not affect any behaviour at all.
https://github.com/Unity-Technologies/EntityComponentSystemSamples/blob/master/Documentation/content/ecs_in_detail.md#automatic-job-dependency-management-jobcomponentsystem

(Longer term we want to avoid that but the code that orders systems and injects them into the playerloop needs to be rewritten to support use cases like networked games properly)

I was actually just looking at the documentation you linked to when you posted. Here’s where I am confused:

// Any previously scheduled jobs reading/writing from Rotation or writing to RotationSpeed
// will automatically be included in the inputDeps dependency.

How?

Thus if a system writes to component A, and another system later on reads from component A, then the JobComponentSystem looks through the list of types it is reading from and thus passes you a dependency against the job from the first system.

But you passed in the inputDeps before I even returned back the JobHandle. So how does the JobComponentSystem even know what to look for?

Unless it is doing some pruning of inputDeps to only use the relevant dependencies when you call Schedule()? And in that case inputDeps really just contained all the previously scheduled jobs via ECS? I’d be both really surprised and really impressed if that’s how it works.

But then…

So JobComponentSystem simply chains jobs as dependencies where needed and thus causes no stalls on the main thread. But what happens if a non-job ComponentSystem accesses the same data? Because all access is declared, the ComponentSystem automatically Completes all jobs running against ComponentTypes that the system uses before invoking OnUpdate.

Where is access declared?

This sounds more magic than Monobehaviour’s magic methods. And does this also work for main thread access in JobComponentSystem for dynamically scheduling jobs? Does this work with the new ChunkIteration API? Does this still work when manually calling Update() on systems?

I’ll try to upload an example project next weekend if it doesn’t click by then. Thank you all for the help so far! I feel I’m really close!

A ComponentSystem persistently remembers it’s read / write dependencies. Normally they are all declared during system creation. That is why GetComponentGroup / GetComponentFromEntity is defined on the system.

Internally the EntityManager maintains the dependencies for each reader / writer type. And then returns a dependency with the right set of dependencies. This means that different systems accessing different seperate data types can run their jobs in parallel to each other. All without the author of these systems having to write any manual code.

One corner case comes to mind, what if a new reader / writer is introduced to the system during OnUpdate. In that case we introduce JobHandle.Complete(); against that type on the main thread on first addition of that reader / writer.
eg. if you called GetComponentGroup (typeof(Position)) during OnUpdate. This happens only first run of that system.

In practice this happens only on first run of OnUpdate, and preferrably everyone would cache their component groups in OnCreateManager.

Not perfectly happy with that corner case but it was the only way to make the passing of dependencies automatic.

So I spent a few hours building a test project to see if the dependency management system actually worked in a predictable manner. This is a simpler use case than what my actual use case will be, but it still broke dependency management.

So I have 4 systems which update as follows:

I create a million entities containing TestData1 and TestData2 components initialized to 0. I also create a result entity with the DifferenceSum component also initialized to 0.

JobSystem1A writes a 1 to all TestData1 components.
JobSystem1B writes a 2 to all TestData2 components.
JobSystem2 attempts to read the first TestData1 component in the first chunk and logs to the console whether or not the value was changed from 0 to 1. Then it schedules a job that subtracts TestData1 from TestData2 for each entity and adds all the differences into the DifferenceSum of the result entity. This should be equivalent to the number of existing test data entities.
System3 attempts to read the DifferenceSum result and logs the result to the console.

I left a comment in JobSystem2 marking a block of code where the main thread dependency can be commented out. But even when I do this I still get dependency errors.

On a side note, IJobProcessComponentData behaves kind of funky in terms of total execution time and number of threads utilized depending on whether I specify a batch size or not.

But right now, I’m more concerned about the errors and incorrect output.

To run, import all the scripts into a project set up for Unity ECS and attach the EcsTestRunner.cs script to some active GameObject. I also pressed pause before entering playmode so that I could inspect the order of debug messages in the first frame.

3628378–295531–ECS Test.zip (5.53 KB)