Share your multi-threading tips and tricks.

My hope for this thread would be to share knowledge and help others.

I’ll start by sharing sharing at least my part of the experience with the dots and c# jobs system. Also I wanted to do some contribution to this awesome society. Also to give others the change to help new comers to create better performing and energy efficient games.

Test scene and setup
I have performance testing environment made with Unitys perfomance testing api. It load scenes with different amount of spawners and and targeting systems. Please note that targeting system is a managed component. If this where a pure ecs based system, then the results would be quite different, but that would destroy the idea of the plugin.

The scene looks like this when its running (256 systems and spawners).

The initial results for the benchmarks before optimization.

What is being measured is targeting systems runners update loop, which runs all the systems.

            Measure.Method(() => { runner.Update(); }).WarmupCount(1).MeasurementCount(15).Run();

After the thread scheduling optimizations the results are better.


Big part of the performance improvement comes from better thread scheduling. First I’ll show images of the profiler that visualizes the difference.

Before (old image):


After (256 systems and 2 entity queries):

As you can see the green lines/thread are more dense on the lower image, which means that the cores are using more time on working and less time idling. The empty space is main thread work that is mostly there because of the editor version is not as optimized as the compiled. As far as my knowledge goes a lot of the time that is visible on the empty space is stripped out in the final build which is around 3-5 times faster than the editor version.

The algorithm is made so that the more queries you add the faster it gets. Since each query gets its own thread. Also the algorithm is my own version of the boids simulation algorithm found in the Unity examples.

6 Likes

Here is a image of the performance tests with 8 queries. There are more systems in this images because I have been developing the performance environment at the same time.

In the image you can see that with 8 queries the milliseconds double when you double the system and spawner count. Without the boids like optimization the milliseconds would instead of doubling grow exponentially.

So how did I manage to optimize almost 2x performance by better thread scheduling?

First of all organizing jobs and changing thread scheduling pattern.

So here is what I did.

before (simplified).

  1. Init variables(main thread job)
  2. Clear targeting arrays (multi-threaded jobs).
  3. Create targets for just spawned objects. (multi-threaded jobs).
  4. Calculate frustum planes from targeting cameras (main thread job).
  5. Calculate visibility of targets. (multi-threaded jobs).
  6. Calculate targeting data. (multi-threaded jobs).

after.

  1. Clear targeting arrays (multi-threaded jobs). Schedule batched jobs (runs jobs on background).
  2. Init variables (main thread job).
  3. Clear targeting arrays .complete() → Finishes the clear targeting array jobs.
  4. Create targets for just spawned objects. (multi-threaded jobs).
  5. Calculate visibility of targets. (multi-threaded jobs).
  6. Calculate targeting data. (multi-threaded jobs).

late update
7. Calculate frustum planes from local world matrix (multi-threaded jobs).

So what happened is that when the update starts, I instantly start to clear targeting arrays on the background. When main thread work on initializing variables ends I ask the c# job system to wait for the targeting arrays jobs to complete returning the control of the arrays for the next jobs.

Just by doing this simple trick you can get plenty of milliseconds and utilize cores more efficiently.

You might wonder what happened to the camera frustum plane processing?
So, I cache the planes local space normals and distances every time the user changes the camera settings, and then in a job I recalculate the location of the planes with the help of the local to world matrix.

The lesson her that I learned is that organizing the how the work happens have quite big effect on how the work is spread, but this was not enough. The jobs where scheduled and there was still a lot of empty space and also the jobs where scattered weirdly.

I don’t have image of it but it looked like the following image.

4 Likes

8156579--1060058--upload_2022-5-25_20-42-34.png
As you can see the jobs are spread and the work is not evenly divided across the threads.

Here is a pattern that I used to dodge this issue.

            foreach (var targetingSystem in targetingSystemsIn)
            {
                if (targetingSystem.EntityBasedProcessing == false)
                {
                    continue;
                }
            
                EntityTargetingJobs.ClearTargetingDataAndCountKeysJob job = new EntityTargetingJobs.ClearTargetingDataAndCountKeysJob()
                {
                    targetChunkKeyCount = targetingSystem.TargetingSystemEntityMemory.ChunkKeysForTargetingCount,
                    chunkKeysForTargeting = targetingSystem.TargetingSystemEntityMemory.ChunkKeysForTargeting,
                    amountOfCurrentlySeenTargetsPerKey = targetingSystem.TargetingSystemEntityMemory.AmountOfCurrentlySeenTargetsPerKey,
                    zero = int4.zero,
                };

                newJobHandle = job.Schedule();
                this.ClearTargetingArraysHandles2.Add(newJobHandle);
            }

Then I finish the jobs at convenient time with.

TargetingSystemUtilities.FinishJobsFromHandles(this.ClearTargetingArraysHandles);

Which is essentially a.

        internal static void FinishJobsFromHandles(List<JobHandle> targetingArraysHandles)
        {
            foreach (var handle in targetingArraysHandles)
            {
                handle.Complete();
            }
            targetingArraysHandles.Clear();
        }

What happens here is that I collect all the scheduled job handles in to an array and schedule the jobs with Unity’s.

JobHandle.ScheduleBatchedJobs();

After the next jobs need the arrays I call the finish jobs method that forces the jobs to complete.

Final words on the optimization.
One big mistake what I did was try to optimize things that are already optimized in the final build. For example the profiler might show that the getter of some property takes a lot of main-thread time, simple example, but after googling the final build code actually optimizes this property example out and is basically just a field call (super fast).

So the profiler shows a lot of this kind of performance bottle necks that should not be optimized, since they are already optimized in the final build by the compiler.

The other mistake I did. Or I’m not exactly sure what is the final cost of the build version of the

JobHandle.CombineDependecies(jobhandleA, jobhandleB);

I originally combined the job handles like that and finally called the merged job handle complete(). Because in most of the cases filling a list with job handles and the just iterating that list and calling job handle complete results in a lot faster code in managed environment like I had. At least based on the performance scenarios I did.

Would be nice to hear what are your thoughts about this and how would you handle cases like this.

3 Likes

I can‘t do more but to confirm your observation: the order of scheduling but also where within a frame you schedule (or complete/check for completion) can have a significant impact on overall performance. Sometimes with surprising results. For example, I had jobs scheduled in Update and checked for completion in LateUpdate. Then I observed how much I can scale it up before it starts to complete not within a single frame but in the next.

Then I thought: what if I start jobs early using the PlayerLoopSystem, hooking a callback in the early update system (just after time was updated). To my surprise (and perhaps I did something wrong but couldn‘t see where or why) this started to finish in the next frame earlier than scheduling in Update. It didn‘t scale as well.

But then it‘s easy to overoptimize and suddenly you get either compile errors due to dependencies, or worse, the behaviour changes because you have dependencies that you thought you don‘t have and due to NativeDisableThisOrThat it proves difficult to debug.

2 Likes

In any case: it‘s well worth checking the profiler timeline for thread congestion or idling. More often than not you can get a boost for free just by optimizing the batch loop count (though the results most likely vary on a per system basis, so much so that it may be faster on your machine but slower on others with different CPU core count or architecture etc etc).

2 Likes

I’m planning to run these tests next on my old laptop as soon as I have time. I had a batch loop count of 1 on my jobs due to nature of the algorithm. One of the reason is that not all the jobs could be run in IJobParallelFor. And for example the algorithm divides the targets in to chunks and chunks are processed in each threads. This way the system scales best with more queries. This was one of the requirement because I wouldn’t be able to use the plugin for my own purpose, if I would be limited to few queries. So I made it so that it scales best with the entity query count.

I think I’ll try running more tests with the batch loop count in play mode and see how they differ. But for me the batch loop count of 1 was the best option.

1 Like

1 other things that I learned that is that you can put a job waiting in a coroutine that way you can make longer running jobs. Or just skip frames and call job handle complete after few frames. With the good old modulo operator.

if(frameCount % 4 == 0){
jobhandle.complete();
frameCount = 0;
}
frameCount++;

or something like that

1 Like

ok one last tip and trick I have. I have not implemented that yet.
Lets take a look at this performance bechmark.


The first row show 1 system with 1 entity query component/filter.
the second set shows if the 1 systems only sees small part of the entities.

For tightly packed entities the result are terrifying. but when the 1 system sees only some entities, the optimizations kick in.

Here is the profile image of the worse case scenario.


From this we can see that there are parts of the timeline that just utilize 1 core.

When 1 system only sees small part of the million targets the situation is a lot better. Note worthy thing is that even if you players camera sees all the entities, you wouldn’t normally want to adjust the system to see all those millions of targets. Its just not needed to get the closest target based on the view direction.

Ok, so how would this be optimized?
The targeting system is made to scale with the amount of system, but in a way that it can potentially handle hundreds of thousands of entities.

Based on the code the only real bottle next we can have control over is this part.
8156945--1060193--upload_2022-5-25_23-3-29.png
This part handles the visibility processing. and since there is only one chunk and query it happens on 1 core.

Well here the only option is to check the amount of systems and if there is only few then switch the inner batch count of the visibility processing higher, so instead of scaling with the amount of systems, scale with the amount of entities.

This however is likely not needed in most cases, unless you have a strategy came that has million of enemies and the system is attached to mouse.

The last single threaded line in the second image is sort targets. This cannot be optimized since sort operation cannot be normally divided in to multiple cores. Unless someone has made multi threaded merge sort that actual out performs the standard unity sort algorithm running on separate thread.

Thats it, hopefully someone benefits from my findings.

2 Likes

While I generally agree with the overall message to pay attention to the way things are scheduled, I’m not sure I totally follow the specifics of your optimization and there’s a few things that smell.

First, JobHandle.CompleteAll() is probably what you want instead of a for loop or JobHandle.CombineDependencies().
Second, based on the massive spike for a single entity query, it seems like you have an O(m * n) algorithm. You might want to try and implement something with a better algorithmic complexity, something like an O( (m + n)log (m + n) ).
And third, here are my quick tips on the topic:

  1. Don’t optimize scheduling until you exceed your frame budget or are late into development. It is a fickle thing.
  2. Try to schedule heavier parallel jobs first, then schedule smaller jobs after. Small jobs can finish and let the worker threads idle before the main thread has the next jobs ready. If you schedule a heavy job first, the main thread has some time to build up a large job chain and keep the workers busy.
  3. Jobs that don’t depend on ECS data are usually free if scheduled correctly. Don’t just schedule them when convenient, but try to schedule them during sync points.
  4. With the exception of situations like (3), be wary of single-threaded jobs that take more than 0.15 milliseconds. Often dependency constraints make it difficult for the job system to find other jobs to run alongside of them.
3 Likes

Thanks for feedback and tips.

I think this is the right way to go in a lot of cases. I was thinking about it first, but then I started thinking, what if I, or someone else has jobs that run few frames. Wouldn’t it force all of these jobs to complete as well?

Ok, so I made a diagram. This is the best case scenario for 1 system and how the system is meant to be used (system has limited field of view around the cross-hair or mouse). At first it looked like its o(n*logn). Then I added tests for 2 mil and 3 mil targets and it started to lean towards O(n). Also this can still be optimized by fine tuning the load of the single thread part of the code in to multiple threads. Based on the timeline it would give around 10-20% boost.

8157929--1060361--upload_2022-5-26_9-13-58.png
Here is the performance test report. The naming is little messy.
8157929--1060373--upload_2022-5-26_9-30-34.png

8157929--1060370--upload_2022-5-26_9-28-44.png

1 Like

So in your experience is it good to schedule massive single threaded clear array job in the late update and finish it just before it is needed? Assuming nobody call Complete all jobs. In some cases it can give 1ms more and in some cases barely anything and in some cases 0.3 ms, but still free ms.

Edit. Nah the gain is so little now that I do more profiling. And definitely not as clean

1 Like

You are asking questions and sharing profiling stats to code I can’t see. And you use the terms “system” and “query” so loosely that it is difficult for me to follow what your algorithm actually does. I really cannot provide good insight.

Edit: Forgot to mention that JobHandle.CompleteAll() takes a NativeArray of JobHandles, so you choose which collection of handles to complete. It is faster than completing one-by-one because Unity can batch-reason about the handles. And it is faster than CombineDependencies because Unity doesn’t need to create a new handle in the graph, but rather just finish off and remove the existing handles.

4 Likes

This is an awesome tip, thank you!

What comes to the optimization. Its boids (the unity fish demo) like algorithm, so it divides targets in to chunks/areas and each system checks, if they see that area. My biggest bottleneck is related to not being able to multi thread “sorting an array to many cores”.

The other biggest bottle neck is clearing the native multi hash map. For sorting the array I’m doing a research on multi threading it with merge sort. These issues come more apparent with millions of entities, so its not in the priority list. More like fun to do list

1 Like


So I converted lists to native lists and used the JobHandle.CompletelAll(list) and got 2-3 ms boost on the editor. In some cases.

btw i recently posted a custom yield instruction in this forum that waits on a jobhandle. i believe i called it WaitForJobCompleted, as in:

yield return new WaitForJobCompleted(handle);
1 Like

For non critical background jobs or something similar its perfect. I have been thinking about something similar. Its quite good since you don’t have to wait for the job to complete.
Downside is that if you have entities that just die instantly. For example job takes 4 frames to complete, but entity dies before that. Also during this 4 frames it’s not a good idea to access the the contents of that job.

I haven’t tested in production, so its hard to say, if it actually works (why wouldn’t it).