I am using a 32 core CPU and when running a simple scene with DOTS physics with converted rigid bodies the performance is much lower than the same scene setup using GameObjects.
I asked on Discord and a user suggested turning down the worker threads because performance can be poor when using a high core CPU.
When turning down the available worker threads (using JobsUtility.JobWorkerCount = workerCount) the performance goes up to what I would expect.
I tested 200 cubes with rigid bodies and found the following results: Using 2 threads: 178 fps Using 62 threads: 94 fps
I did some further testing and found that increasing the thread count did help significantly when there was a high volume of collisions, but when collisions aren’t happening the fps remains very low when using 62 threads relative to using 1 or 2 threads.
Looking at the profiler, it seemed that it was trying to split up everything between all cores when not much was going on, which would explain why the performance is so much lower with a very high thread count.
It also seems that there currently isn’t any consideration of context switching costs when running jobs, creating a large delay in allocating between threads for simple tasks.
Is there any plan to address this issue? I feel like the whole concept of DOTS is to prevent inefficiencies, so I was surprised that it is performing tasks that are adding overhead.
Is it expected of the user to handle the worker count?
It would be nice to be able to allocate the maximum workers for a specific job or based on the number of jobs.
Are you able to share profiler timeline screenshots with the different thread counts? There is an overhead to scheduling jobs across more threads, but not a 5 ms overhead. I suspect another issue may be at play here.
I also changed the thread count in the physics step, but that didn’t seem to have any impact when I was manually setting the count using JobsUtility.JobWorkerCount = workerCount @DreamingImLatios yeah here is the profiler for 300 cubes which is similar, it had roughly 125 fps
Thanks, but I asked for the Timeline view. Not the hierarchy view. Specifically I want to see the main thread and the job threads. Also, is this 2019.4, 2020.1, or 2020.2?
Awesome. That’s a lot more helpful. It looks like it is one of two issues, but I’m not sure which is the culprit yet as I would have to see what the scenario with only two worker threads looks like. Either the smaller number of worker threads is increasing your cache hit ratios between jobs or the job scheduling is absolutely choking in StepPhysicsWorld. The latter would be Unity’s problem. The former is a bit more difficult to deal with and has plagued some of my own projects.
There is a common misconception that full core utilization is a goal. I think it just comes from so many years of only having very few cores on most rigs, combined with game engines historically not leveraging concurrency well.
DOTS feature level stuff just uses settings that ensure you can actually leverage cores if you have them. There really isn’t a default setting that won’t end up confusing some group of users. If the defaults were set for what real games would use then people would complain about cores not being utilized (and would have to change the source to fix that). So I think what they have as a default is probably the best choice atm.
The ideal would probably be DOTS feature level concurrency fully tunable. And features and global worker max set to the high end of average.
I don’t think people are complaining even about core utilization here but rather Unity spreading every systems workers to as many worker threads as possible by default. It’s not users that do this. Of course high core utilization is waste in cases like this since don’t get any actual gains from it. With current state of stock DOTS packages this doesn’t give us “performance by default”. It’s not just Unity Physics package, same thing definitely happens with Hybrid Rendering v2 as well.
How I see it, the issue isn’t splitting the work to workers alone but how much we split individual systems over them. Right now we have no control over this beyond following: run in main thread, schedule to be run in single worker thread or schedule to run on as many workers as possible. And then there’s the global cap only for the workers which automatically get set by your computer cpu hw thread count (-1) which you can and currently have to limit to lower value if you care about the performance at all.