Burst / DOTS Systems Performance Behaviour / Documentation

I’m currently learning DOTS and doing some experiments to get a feeling for the performance behavior with different numbers of entities and different layouts for components. I’m using Unity 2019.4.5, Entities 0.11.1.

In my current experiment, I create 1.000.000 entities and have a system that checks whether each entity (or the one component it has) is currently active by checking Time.ElapsedTime against a time interval stored in the component.

In the EntityDebugger and Systems pane, with 1 million entities, that system takes about 0.02ms. So, to put some “load” into system, I wrapped the actual code into a loop. Doing it 100 times changed almost nothing (still about 0.02ms). At 1000, it suddenly went up to 20ms. 500 had 10ms. So far, so good, even if going from 100 to 500 seems odd (0.02ms to 10ms) but probably something “breaks” there between 100 and 500, and then it’s linear from 500 to 1000. My guess would be that this is loop vectorization being possible at 100 but no more at 500 (that’s apparently not it, though).

Then, I realized I had a bug and replaced an if-statement with a bool assignment. Time went back down to 0.02ms (still with the 500 loop, and even when I increased that loop to 1000 iterations).

So I thought “hm, okay, branching in burst-compiled code is very bad for performance”. Except when I added the if-statement back in, time stayed at 0.02ms. Removed the bool assignment (but kept the if-statement), and BOOM, back to 20ms.

That’s factor 1000 worse performance by removing a simple assignment.

Turns out when I include gem.SpawnOrDestroyThisFrame = gem.IsActive != nowActive; I get only one “loop not vectorized” warning in the Burst Inspector, on the line of the for-statement. When I remove it, I get two of these messages, both on the line of the if-statement. Changing the number of iterations (100, 500, 1000) doesn’t make a difference in that regard.

Now, I could probably spend the next two weeks trying to figure out all the possibilities but … is there some documentation that explains what kind of statements / coding constructs have which performance impact? I did find Burst User Guide but from reading that, I don’t understand why adding this line improves performance / changes vectorization like it does. I even tried assigning gem.IsActive != nowActive to a temporary bool that I then use in the if-statement but that doesn’t seem to change anything.

Here’s the code of the system:

protected override void OnUpdate() {
    double time = Time.ElapsedTime;
    Entities
        .ForEach((ref GameplayEventMovement gem) => {
            for (int i = 0; i < 1000; i++) {
                gem.CurrentTime = time;
                bool nowActive = gem.SpawnTime < time  && time < gem.DestroyTime;
               
                // *removing* this => performance 1000 times *worse*
                gem.SpawnOrDestroyThisFrame = gem.IsActive != nowActive;
               
                // keeping or removing this makes no difference, if above statement is present
                if (gem.IsActive != nowActive) {
                    gem.SpawnOrDestroyThisFrame = true;
                }

                gem.IsActive = nowActive;
            }
        })
        .WithName("CheckGameplayEventActivityJob")
        .ScheduleParallel();
}

Could be great if we could see the struct declaration of GameplayEventMovement.

But in this case I am guessing it maybe a bug of vectorizer that tries to vectorize a loop, where the looping variable i is not being used anywhere? Since every round is performed on to the same gem, the correct result should be that the 1000 loop cannot be vectorized because next round may take some data from previous round. (gem.IsActive, though the time was frozen and the result must be the same)

Or maybe that the outer vectorizer (outer loop in the source code which you can’t see, that is iterating on each gem, which should be mutually exclusive to each other) got into problem with analyzing the inner loop and decide that even the outer loop can’t be vectorized?

ps. I think the best way is to make both versions that has the if and is faster and the one without and is slower, write 2 perf tests with Redirect to... title of new-page, then just send the project to Unity so you don’t have to debug into potentially a bug

1 Like

That would be

public struct GameplayEventMovement : IComponentData {
    public double SpawnTime;
    public double ImpactTime;
    public double DestroyTime;

    public double CurrentTime;
  
    public bool IsActive;
    public bool SpawnOrDestroyThisFrame;
}

CurrentTime is only for testing / debugging purposes (makes things a little more convenient for me in the Entity Debugger).

Maybe adding that loop for testing purposes wasn’t the greatest idea because it introduced “funny stuff”. At least I did learn a little bit about loop vectorization in Burst but I do wonder what other oddities I’ll run into.

I guess the best approach for learning DOTS is taking really just one single step at a time, and then profiling on all target devices (this will end up on Windows, Quest, PS4 and probably Linux), then take the next one, single, small step?

Hey jashan, long time…

One recommendation I have for getting started with DOTS is to not over optimise.

With DOTS you very very good performance if you just follow basic rules.

  • All game code is bursted. (No managed classes in any of the game code, wherever possible)
  • Write parallel code where you can
    Using Entities.ForEach().ScheduleParallel() for most of your code is a really good starting point

This gets you really solid performance. It seems like you are able to handle a million entities. As a starting point, maybe thats a place to take a step back and just say thats totally fine and you dont need to go much further as a first step? it’s probably ~100-200x speedup compared to what you are used to getting out of the box.

It looks like the code you posted here is doing that already…

Now if you want to go further with optimization thats cool, with DOTS you can go all the way to the limit of the hardware. I think after following just the basic rules, the next step is to really understand how hardware actually works and write your code accordingly. In order to really get a good handle of it, it helps to understand how SIMD actually works.

This talk is a good intro on how to use the lowest level API’s talking to the hardware directly:

Writing with intrinsics isnt necessary to get great performance but understanding what the hardware does helps a lot.
The most important thing in relation to your code above, is to avoid branches at all cost. If you have branches, the compiler can’t unroll / vectorize your code.

In your code one way of doing this is like this:

gem.CurrentTime = time;
bool nowActive = (gem.SpawnTime < time) & (time < gem.DestroyTime);

// *removing* this => performance 1000 times *worse*
gem.SpawnOrDestroyThisFrame = gem.IsActive != nowActive;

// keeping or removing this makes no difference, if above statement is present
gem.SpawnOrDestroyThisFrame = math.select(gem.SpawnOrDestroyThisFrame, true, gem.IsActive != nowActive);

gem.IsActive = nowActive;
4 Likes

I would expect that the reason for the 1000x speedup is because burst is extremely good at detecting code that has no side effects and completely eliminates it. I am going to bet that the reason for taking out the SpawnOrDestroyThisFrame line is so much faster, is because the resulting behaviour is that burst can prove that running the loop 1000 times is unnecessary & it can just calculate the value once and skip the loop…

3 Likes

A good way of finding out is to open the burst inspector and look at the actual generated disassembly. It can help you understand what the compile actually did with your loop.

2 Likes

Lastly… using doubles is something you should avoid whereever you can for performance reasons. Thats a good baseline advice when writing simple game code.

2 Likes

Hey Joachim, thanks for chiming in! I remember the conversation we’ve had about this a looong time ago, and it’s really cool to finally get to play with it!

So my initial thought that branching can be a problem in DOTS was right … and I must not ever add fake-loops for trying to slow things down a little bit because it makes very bad things happen :wink:

Is my assumption correct that when I stay within ECS/DOTS, instantiating and destroying entities more or less randomly is fine (unlike in the GameObject world where you really need to pool things or GC will eventually bite you)? In my use case, I’m talking about maybe hundreds of objects at any given time, usually with a life-time of a few seconds.

Right now the answer is depends…

If you instantiate lots of the same prefab 50000 times, thats very very fast. You can instantiate around 50k instance in 2ms.
var array = EntityManager.Instantiate(prefab, 50000);

If you instantiate 1 at a time, that can still add up. Its still orders of magnitude faster than game object, but its worth nothing that right now the batched path is much better. There is still plenty we can and will do on making the single instantiate fast path much faster. Definately a core goal to make worrying about instantiate cost be a non-thing.

Also important to note that instanting objects doesn’t allocate GC memory & also doesn’t allocate unsafe memory. It all goes through pooled memory internally.

1 Like

@jashan About your struct, it maybe related or not but your struct is 8*4 + 2 (bool = 1 byte?) = 34 bytes, which is an unfortunate size since common cache line size is 64 bytes, if you could reduce by just 2 bytes then you get 2 gem in one read. Supposed that Burst optimize nothing and logic goes exactly as your code, the first data read is at gem.CurrentTime, which if ECS aligns the data, it may goes back to the beginning of gem (SpawnTime), then fetch you 64 bytes getting this entire gem plus almost entirely of the next gem, but missing the final 2 booleans of the next gem. So in the next outer loop, you are missing the last few bool in your gem because the previous fetch can’t reach it. Separating the last 2 bools into a new kind of component might help if you are keeping double data type. Now you get 2 gem on one read and then get many pairs of booleans in an another read.

About slowdown the best way might be just stare at the assembly … so I have tried pasting your job and view them. Seeing the code with 1000 loop in both version and it seems both have the 999 assign then dec one by one to nothing with no logic in between at all routine, then it could jump out. Therefore we can conclude that the work has been optimized to be done before the loop in both versions, we can remove the loop and try again.


(notice that +32, +33 here is likely it accessing both your booleans)

After removing the loop, we can compare 3 cases :

  • The one with both compare-assign and the if (faster)
  • The one with only compare-assign (you didn’t mention it, but I want to try)
  • The one with only if (commented out only the compare-assign and is now 1000x slower)

All 3 cases it still report that loop (outer, since we removed 1000x inner ones) still could not understood by the vectorizer as said in the debugger. The reason is likely that the contiguous data is too big (4x double + 2x boolean), but you assign only a bit of it (the last double and the booleans) and it is having hard time using vector version of mov command to do work on only those bits… it is likely better if you separate the final double CurrentTime to be alone in a new component, and also 2 bools to be together in a new component. Maybe it could vectorize then.

Ignoring the vectorizing problem (note that the addressing is for some reason inverted, above all these pics has rax + 33 and now -9 arrives at your 4th double, -1 to your bool and so on.) :

Case 1 and 2 produces the same code, but yellow text is different. I think yellow text just pick some example lines out for you. Just rax is the last bool IsActive where it had been setne, rax-1 is .SpawnOrDestroyThisFrame where it was given value with mov. There is no yellow text mentioning .SpawnOrDestroyThisFrame in the first case, but the mov is definitely assigning to that variable. Therefore we conclude that in case 1, the if with true assignment inside has been ignored because this one is assign-compare. Case 2 only has assign-compare, so that produce equal code to case 1. Add 40 proofs that this faster version is still not vectorized since it moves to the next one. And it being 40 instead of 34 I think because it think 34 is a weird size so it pads the data after your 2 bools that it is a multiple of 8… anyways you should do something to get it below or exactly 32.

The last one I think you had effectively forced it to compile the if case because you removed the better compare-assign version. It is longer and is likely the source of slowdown you get here. You see tons of j here doing crazy stuff… there is even double nowActive clause in yellow because it seems to try to be smart by doing that early once and another in a loop that move rax+40 one by one (note that the slowdown must not be related to vectorizable or not problem but just because it has longer assembly, because it said it could not vectorize the outer loop either way)

The first and second case also has 3 j which is still suspicious since you has no if if it decided to use the compare-assign, and therefore likely need only 1 je over there for the +40 routine so it could stop working. Why is there ja and jmp also? I found that replacing && with & in the nowActive calculation line like Joachim said will eliminate down to 1 jump as expected (!) Along with this after rearrange your struct to be leaner to get vectorized then I think it will be fast.

4 Likes

@jashan I remember you were one of the first people who made an actual working game with unity which unity showcased a bit as well. It was a Qix like game IIRC :slight_smile: Anyways for what it worth, We are making a mostly crafting based big world game with DOTS and we have a show system where users create firework like shows with lots and i mean lots of colorful balls. For now they don’t have lights but as soon as URP adds its deferred renderer they’ll have lights too.

Atm we create like 50k entities in a second and destroy about 50k as well with the single instantiate call in command buffers in Entities.Foreach().ScheduleParallel() and our frame rate is 30 FPS on a relatively old Core i5 processor in the editor with safety checks enabled. And we did not heavily optimize anything yet. All balls have triggers and there is a job which actually processes all of the triggers and …

I’ll write a showcase post soon and hopefully some generic tutorial sort of things on our dev blog and on gamasutra but what I wanted to say is that it is already really fast without trying too much optimizations as Joachim said.

The beauty is that when you get used to using native containers for baking the input for a system as output of another and use linear queries (i.e. foreach) for things which are possible instead of calling componentDataFromEntity too much, it is as fast as hell!

2 Likes

Ok, so then it’s probably best to instantiate them all when a session begins and just hide them until needed, and then destroy them when the session is over. This is super-helpful to know. The nice thing is that in my use-case, a lot of what is going on is quite deterministic. I also have fairly specific needs when it comes to physics, so I’m also looking at Unity Physics for that.

This is so cool. I still have some pretty crazy, ugly stuff going on with pooling in my old approach (I’ve had quite a few sporadic bugs that occurred due to state not being properly reset) … now I just have to be careful to not be too cautious when rebuilding all this stuff on the new tech stack :wink:

In fact, I can go much lower than that: This component is really just about figuring out if at the current time, a given Entity should be visible. So SpawnTime, DestroyTime and IsActive are actually enough. And IsActive will also go to another component (see below why I believe that’s how I should do it).

In my old approach, I had a sorted list and just checked if the next item was ready to be spawned, then incremented the index until all items that needed to be spawned were spawned, and the items were responsible for their own destruction when their time had come. Good old OOP, I will miss you … which was elegant, especially in OO-thinking, until I realized I need to move time back and forth (looping, rewinding, jumping to a random point in time).

What I’m working on now is a system where that kind of “random time access” works. There’s a bit more complexity there because in the case of looping, I’ll need to keep track of two times - but that’s another story :wink:

So, even ImpactTime is probably irrelevant for that component. Let alone the positions. And the idea behind SpawnOrDestroyThisFrame also probably is history already: The rationale behind that was to set it while iterating over the items, and then either instantiate or destroy items as needed in any given moment. After what Joachim wrote, I’ll probably just need “IsActive” (and that will then also determine for which of these items I need to calculate positions, and stuff like effects).

Oh, and also, I changed double to float. I do like double for the precision (and TimeData.ElapsedTime comes in double, and so do my audio times, which must be double for precise looping) - but for the rendering, float should be fine and simplifies a lot of things, plus is much more compact in memory.

Hehe, I like that. That’s also me when working with shaders. I stare at them, and sometimes, they give in. But more often, I do :wink:

I might actually get away completely without any “ifs” there. In the code I posted above, it was actually “buggy thinking” but I will then need to filter for IsActive in a later step. But for that purpose, it seems that Shared Component data is what I need. I need to check IsActive for every Entity each frame, even though it will only change twice in each session (become active, then become inactive) - unless there’s jumping back and forth in time.

But then, there’s a lot of stuff to be done for the (comparatively few) entities that have IsActive set to true.

I have also read the rest of your posting and appreciate you laying it out like that a lot. I’ll admit that it would take me much longer to understand everything you wrote fully - but I picked up a lot of useful information, so that’s super-cool. I’m very happy! Thank you!

1 Like

Haha, yeah - I remember you, too! It was originally called “JC’s Unity Multiplayer TRaceON” and then just “Traces of Illumination”. There were much better (and more complete) games done in Unity even at that time, though. I still did a mobile version with some 3rd party UI system, IIRC … but never completed all twelve levels (what Valve has with 3, I have with 12, it seems :wink: ). I believe that game got stuck at Unity 2.5 … or was it 3.5 … probably 3.5? Then I wrote a book based on the same game (but only one level, HA!) and realized everything I did very wrong with my original approach … that book was finished with Unity 5 (it was one of the two books on Unity in German at that time).

Very cool!

Yeah, not over-optimizing is definitely very sound advice! But I feel I really need to have a basic understanding of what I’m doing in the DOTS-approach. I very much loved the abstract world of OO and I guess one could say my mind pretty much works in OO … so at the moment, getting into the data oriented mindset is a lot of effort.

But I had quite a few epiphanies even today, so it’s fun :slight_smile:

Sounds cool :slight_smile:

1 Like

Hm … maybe not? I just found https://gametorrahod.com/everything-about-isharedcomponentdata/ and believe that may have the answer … but haven’t quite digested all of it, yet :wink:

The way it started to work for me was.
Don’t think of the problem as, what objects in the real world and with what properties and behaviors should I use to model the problem? Instead think of it as what is the data that I need (i.e. a set of entities and their instantiation and destruction times) and what do i need to do to them to have the result I want. How does it interact with other data transformations?

Watching a few Mike Acton data oriented talks helps a lot too. Imagine he is screaming those in your face :slight_smile: Then don’t resist and dive in and reflect on it as you move forward.

Regarding burst, Take a look at Intel’s x86 manual or some assembly tutorials to re-learn or learn the properties of processors and how they work if needed. I did read a good portion of intel’s manual really.

The OOP VS DOD approach is kinda like pure rational phyilosophy vs experimental science and wisdom… In philosophy you’ll try to create a mental model of the world and if you are sane, you’ll check to see how it works in the real world and how much the real world lends itself to it vs experimental science and wisdom which checks what happens in the world and is only concerned what is the effect of what and how to improve it.

By looking at the real things you’ll have these advantages

1- there is no mental model to learn other than the actual thing happenning
2- it is easier to modify since it doesn’t rely on any additional models unrelated to the actual problem
3- it has higher perf because it is aligned with the real world/hardware/running environment
4- It is easier to maintain because of the above and when you explore something for the first time (or after a month) you need to understand how it can go wrong and how it can be changed for the better and not having to decode an additional model on top makes it less scary and more controllable

I wrote this in case you needed more encouragement :slight_smile:

3 Likes

If I have understood you correctly here, I think that ISharedComponentData is not the right tool. As I understand it, you are talking about storing a single bool’s worth of information - an isActive flag for each component. I know of two main approaches to this.

The first is a tag component, which is just an empty struct. struct IsActive : IComponentData {}. You add this component when you want the value to be true, and remove it when you want the value to be false. This gives you zero cost querying. Your systems can do Entities.WithAll<IsActive>() and Entities.WithNone<IsActive>() to filter. The act of adding and removing the tag does have some cost, though. These “structural changes” require a sync point.

The other is a simple struct IsActiveValue : IComponentData { public bool Value; }. This doesn’t allow efficient querying, so every system has to process every entity and check this bool. That’s a lot of extra processing, but it allows you to change the value without a sync point.

The first way is usually better because it allows systems to accurate query for only the data they need to touch, and this is a common pattern. It’s the semantically beautiful approach. The second way may be better under unusual circumstances where you’re changing the flag very rapidly, where the flag is almost always true so the querying efficiency doesn’t filter many entities out, or where you’re consuming the flag from so few systems that the structural change overhead is not justified. Personally I would always default to the first option, and consider the second option as an optimisation under special circumstances, when needed.

We’re all still working out the best practices, though.

1 Like

Joachim Ante said in another thread that it’s important to avoid branches whenever possible.

So this bool has to be checked in a system and thus leads to slower code by default. Maybe it is better to avoid this “pattern” altogether and find a better solution for the problem?

Thats the “problem” here. As early adopters this “burden” is placed on our backs. But it’s also interesting and exciting. But requires a lot of time and effort to study.

1 Like

Right, this is part of why this bool pattern is generally slower than the tag pattern. Empirically there are some situations where the bool comes out ahead, but generally it’s best to use tags. If you can merge the structural change into an existing ECB sync point then there’s very little cost to that. I believe that the ECS has some specific optimisation to try to make tags fast. They’re the standard approach, and are used throughout the engine.

It’s worth noting that it’s sometimes possible to do the bool pattern without branches. For example, an IncrementIfEnabled job body could be written as data.Value = select(data.Value, data.Value + 1, data.IsEnabled).

That’s the burden we all volunteer for when choosing to play with preview features. I do agree that better centralised documentation would be good for everybody. A lot of what I know I only learned from stumbling across an obscure forum thread. Unity would probably get better feedback if people were more able to use these features without such a burden.

2 Likes

That sounds good. At the moment, at least on my PC, it looks like I might actually be best of in terms of performance, by keeping it simple and just processing everything all the time. The reason is because the number of entities seems to hardly make any difference (I can do 200, or 1000000, and the difference is between 0.03ms all the time and 0.03ms most of the time, and sometimes goes up to 0.04ms, and very very rarely to 0.05ms).

If, on the other hand, I split the processing into two jobs (first determines which items are currently active, second does the actual processing), even with only 200 entities, I usually get 0.05ms, sometimes 0.04ms, sometimes 0.06ms. And when I ramp it up to 1000000 entities, it looks like I’m getting the exact same performance (and I’m not even filtering the items in the second step - just noping out if it’s not active by if (!item.IsActive) { return; }).

But this is on a very powerful desktop. So I’m afraid I really need to move quickly to testing this also on mobile and consoles. Also, at the moment, the processing is still quite simple, and it will eventually become a little more complex.

So one question about the tag component: Is there an easy way to add/remove the component “naively”? In other words, can I just add the component and if it’s already there, it will be fine? Or remove the component, and if it’s not there, all will be fine? Or do I have to

a) find out is it active or not
b) find out if it currently has the component attached or not
c) proceed accordingly

Since c) will happen in a command buffer, I guess that already answers the question because the fewer commands that command buffer has, the better, right?

But in that case, I wonder if adding an IsActive property back for quicker tests (at the cost of larger chunks) might be reasonable. Ok, forget it - just thinking out loud, it’s actually much simpler than that:

a) for all inactive items (WithNone): are there any that need to become active?
b) for all active items: if no longer active, drop out and remove active in command buffer, otherwise, do processing

So in that case, as long as the overhead of having two jobs outweighs the cost of the actual processing, I’ll just run over all items, and when the actual processing becomes too costly, I fall back to the a/b approach where the second step only runs over the items that actually need processing.