Why am I not seeing a performance increase with the Job System?

So I have this script that is meant to be attached to 1000+ GameObjects in the scene. The script moves the GameObject it is attached to, up by 1 unit. In the script I have two methods - UpdateCubeWithJob and UpdateCube. UpdateCubeWithJob uses Jobs to move the GameObject up and UpdateCube moves the GameObject up without Jobs. Here’s the code:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Jobs;
using Unity.Jobs;
using Unity.Collections;

public class CubeScript02 : MonoBehaviour {

    struct MoveJob : IJob
    {
        public NativeArray<Vector3> jPosition;
        public NativeArray<Vector3> jVelocity;
        public float deltaTime;

        public void Execute()
        {
            for(int i = 0; i < jPosition.Length; i++)
            {
                jPosition[i] += jVelocity[i] * deltaTime;
            }
        }
    }

    // Use this for initialization
    void Start () {
      
    }
  
    // Update is called once per frame
    void Update () {
      
        UpdateCubeWithJob();

        // UpdateCube();

    }

    void UpdateCubeWithJob()
    {
        NativeArray<Vector3> _jPosition = new NativeArray<Vector3>(1, Allocator.Persistent);
        NativeArray<Vector3> _jVelocity = new NativeArray<Vector3>(1, Allocator.Persistent);

        for(int i = 0; i < _jPosition.Length; i++)
        {
            _jPosition[i] = transform.localPosition;
            _jVelocity[i] = new Vector3(0f,1f,0f);
        }

        MoveJob moveJob = new MoveJob()
        {
            jPosition = _jPosition, jVelocity = _jVelocity, deltaTime = Time.deltaTime
        };

        JobHandle moveJobHandle = moveJob.Schedule();

        moveJobHandle.Complete();

        for(int i = 0; i < moveJob.jPosition.Length; i++)
        {
            transform.localPosition = moveJob.jPosition[i];
        }

        _jPosition.Dispose();
        _jVelocity.Dispose();
    }

    void UpdateCube()
    {
        transform.localPosition += new Vector3(0,1,0) * Time.deltaTime;
    }
}

When I use UpdateCubeWithJobs I see the following data in the Profiler:


One the other hand, when I use UpdateCube, I see the following data in the Profiler:

The blue bars are all from the script attached to the 1000+ GameObjects.

My question is, why am I seeing better performance without using Jobs? Am I doing something wrong?

Well, you just moved the execution of your calculations into a different thread instead of the mainthread. Meanwhile, the mainthread is blocked because it waits for the calculation to finish. There is no parallelism inside your job!

Instead of doing all in just one thread, use IJobParallelFor for parallelism:

struct MoveJob : IJob
    {
        public NativeArray<Vector3> jPosition;
        public NativeArray<Vector3> jVelocity;
        public float deltaTime;
        public void Execute(int i)
        {
              jPosition[i] += jVelocity[i] * deltaTime;          
        }
    }

You can than just call

MoveJob moveJob = new MoveJob() {
     jPosition = _jPosition,
     jVelocity = _jVelocity,
    [URL='http://unity3d.com/support/documentation/ScriptReference/30_search.html?q=deltaTime']deltaTime[/URL] = [URL='http://unity3d.com/support/documentation/ScriptReference/30_search.html?q=Time']Time[/URL].[URL='http://unity3d.com/support/documentation/ScriptReference/30_search.html?q=deltaTime']deltaTime[/URL]
};
JobHandle moveJobHandle = moveJob.Schedule(jVelocity.Length, 1);
moveJobHandle.Complete();

This jobtype will automatically split the task into different threads. Increase the batchsize (second parameter of moveJob.Schedule) till there is no performance benefit… Just try and error…

I’ve tried using IJobParallelFor as well. Here are the results.
This is a script that Instantiates 10000 GameObjects, and then moves each GameObject up by 1 unit. Even here, I have two methods, one that moves the GameObjects up using Jobs (but in this case it’s IJobParallelFor) and another method that moves the GameObjects up without using Jobs. Here’s the code:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Jobs;
using Unity.Jobs;
using Unity.Collections;

public class SomeScript04 : MonoBehaviour {

    public Vector2 GridSize;
    public List<GameObject> CubeBatchList;
    public GameObject CubePrefab;
    public int InnerLoopBatchCount = 0;

    struct ParallelMoveJob : IJobParallelFor
    {
        public NativeArray<Vector3> jPosition;
        public NativeArray<Vector3> jVelocity;
        public float deltaTime;

        public void Execute(int i)
        {
            jPosition[i] += jVelocity[i] * deltaTime;
        }
    }

    // Use this for initialization
    void Start () {
        CreateCubeBatch();
    }
  
    // Update is called once per frame
    void Update () {
      
        UpdateCubeBatchWithJob();

        // UpdateCubeBatch();

    }

    void UpdateCubeBatchWithJob()
    {
        NativeArray<Vector3> _jPosition = new NativeArray<Vector3>(CubeBatchList.Count, Allocator.Persistent);
        NativeArray<Vector3> _jVelocity = new NativeArray<Vector3>(CubeBatchList.Count, Allocator.Persistent);

        for(int i = 0; i < CubeBatchList.Count; i++)
        {
            _jPosition[i] = CubeBatchList[i].transform.localPosition;
            _jVelocity[i] = new Vector3(0f,1f,0f);
        }

        ParallelMoveJob parallelMoveJob = new ParallelMoveJob()
        {
            jPosition = _jPosition, jVelocity = _jVelocity, deltaTime = Time.deltaTime
        };

        JobHandle parallelMoveJobHandle = parallelMoveJob.Schedule(CubeBatchList.Count, InnerLoopBatchCount);

        parallelMoveJobHandle.Complete();

        for(int i = 0; i < CubeBatchList.Count; i++)
        {
            CubeBatchList[i].transform.localPosition = parallelMoveJob.jPosition[i];
        }

        _jPosition.Dispose();
        _jVelocity.Dispose();

    }

    void UpdateCubeBatch()
    {
        for(int i = 0; i < CubeBatchList.Count; i++)
        {
            CubeBatchList[i].transform.localPosition += new Vector3(0,1,0) * Time.deltaTime;
        }
    }

    void CreateCubeBatch()
    {
        CubeBatchList = new List<GameObject>();
        for(int i = 0; i < GridSize.x; i++)
        {
            for(int j = 0; j < GridSize.y; j++)
            {
                GameObject go = Instantiate(CubePrefab, Vector3.zero, Quaternion.identity);
                go.transform.localPosition = new Vector3(i, 0f, j);
                go.transform.parent = transform;
                CubeBatchList.Add(go);
            }
        }
    }
}

When I use UpdateCubeBatchWithJob, I see the following result in the Profiler:


When I use UpdateCubeBatch, I see the following result in the Profile:

I’ve tried setting InnerLoopBatchCount to 1,32,128,1024 and 4096.
Even here, I’m not seeing a performance gain. Am I looking at this the wrong way?

Calling complete immediately after schedule is not the idiomatic way to handle this. It’s probably forcing most if not all of the work to be run on the main thread.

A more idiomatic way of doing this would be you have a job that calculates the movement, whatever that is. Then you have a job that moves the transforms. The transform move job has a dependency on the calculation job.

As for the flow. At the start of Update you complete any pending jobs. After that is where you schedule them. So you give your jobs a full frame to work before trying to complete them.

1 Like

As others have said, it’s best to Schedule() early in a frame, Complete() later (like in LateUpdate() for example) and/or set things up with dependec handles. With ECS, the Complete() call is handled by the system and all you have to worry about is scheduling setup via systems (though I see you’re not necessarily using the ECS framework).

On top of that, your job is not marked for Burst compilation. If you have the Burst package you can add [ComputeJobOptimization] as an attribute to your job struct and it will compile your job kernel with Burst optimizations.

1 Like

Based on the very useful info I got from the replies, I made some changes to the code, and I wanna share the results, for whoever it helps.
Before I post the code, I should mention, either there isn’t adequate info on IJobParallelForTransform, or I didn’t look hard enough. Either way, for reference, I looked through Stella Cannefax’s repo of Job System Examples. Specifically this code: job-system-cookbook/Assets/Scripts/AccelerationParallelFor.cs at master · stella3d/job-system-cookbook · GitHub

So I changed the code so that now it’s using IJobParallelForTransform instead of IJobParallelFor. Here’s the code:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Jobs;
using Unity.Jobs;
using Unity.Collections;
using Unity.Burst;
using UnityEngine.Jobs;

public class SomeScript04 : MonoBehaviour {

    public Vector2 GridSize;
    public List<GameObject> CubeBatchList;
    public GameObject CubePrefab;

    Transform[] cubeBatchTransforms;
    TransformAccessArray _jTransformAccessArray;
    ParallelTransformMoveJob myParallelTransformMoveJob;
    JobHandle myParallelTransformMoveJobHandle;

    [ComputeJobOptimization]
    struct ParallelTransformMoveJob : IJobParallelForTransform
    {
        public float deltaTime;

        public void Execute (int i, TransformAccess jTransform)
        {
            jTransform.position += new Vector3(0,1,0) * deltaTime;
        }
    }

    void OnEnable()
    {
        CreateCubeBatch();
       
    }

    // Use this for initialization
    void Start () {
       
    }
   
    // Update is called once per frame
    void Update () {
       
        UpdateCubeBatchWithJob();

        // UpdateCubeBatch();

    }

    void UpdateCubeBatchWithJob()
    {
        myParallelTransformMoveJob = new ParallelTransformMoveJob()
        {
            deltaTime = Time.deltaTime
        };
        myParallelTransformMoveJobHandle =
            myParallelTransformMoveJob.Schedule(_jTransformAccessArray);

    }

    void LateUpdate()
    {
        myParallelTransformMoveJobHandle.Complete();
    }

    void UpdateCubeBatch()
    {
        for(int i = 0; i < CubeBatchList.Count; i++)
        {
            CubeBatchList[i].transform.localPosition += new Vector3(0,1,0) * Time.deltaTime;
        }
    }

    void CreateCubeBatch()
    {
        CubeBatchList = new List<GameObject>();
        for(int i = 0; i < GridSize.x; i++)
        {
            for(int j = 0; j < GridSize.y; j++)
            {
                GameObject go = Instantiate(CubePrefab, Vector3.zero, Quaternion.identity);
                go.transform.localPosition = new Vector3(i, 0f, j);
                go.transform.parent = transform;
                CubeBatchList.Add(go);
            }
        }

        cubeBatchTransforms = new Transform[CubeBatchList.Count];
        for(int i = 0; i < cubeBatchTransforms.Length; i++)
        {
            cubeBatchTransforms[i] = CubeBatchList[i].transform;
        }

        _jTransformAccessArray = new TransformAccessArray(cubeBatchTransforms);
    }

    void OnDisable()
    {
        _jTransformAccessArray.Dispose();
    }
}

This time there are 160000 GameObjects being moved up by this one script, and I actually saw a performance gain! So when I use UpdateCubeBatch, I get the following results in the Profiler:


When I use UpdateCubeBatchWithJob without the ComputeJobOptimization attribute, I get the following results in the Profiler:

And when I use UpdateCubeBatchWithJob with the ComputeJobOptimization attribute, I get the following results in the Profiler:

Definitely seeing performance gains! I’m kinda starting to get an idea of how I should go about doing things when using the Job System. And I’m doing research on ECS too so hopefully there should be more performance gains. Thanks for the info everyone!

3 Likes

How much of a performance gain are you seeing? I can’t tell from the screenshots. I’m here, because I’m writing similar code, but not seeing any performance gains.

In the screenshots, it’s the little tooltip that has SomeScript:ParallelTransfromMoveJob

It goes from 68.40ms → 37.7ms → 14.31ms.

1 Like

Here’s how to analyze the performance gain and read the graph…

Look at the first profiler graph. It says 68.40ms in the main thread. When the blue bar is done we can go to the green …bounding volume… where it needs to wait on further. (The grey bar saying WaitForJobGroupID)

The 2nd graph the blue bar at main thread is shorter to about half of earlier to 37.79ms, and another blue bar appears on the worker thread that is almost the same length. But it does not mean main thread is working together at the same time with worker thread because you can see a grey bar WaitForJobGroupID below the blue bar in the main thread. That means it is not the real work, this main thread blue bar is just waiting for the real work that is running on the worker thread to be complete. So in the 2nd graph we managed to cut down the time by 68.40-37.79ms

The 3rd graph read the same way but even less time to 14.31ms. We saved 68.40-14.31

In the 2nd and 3rd graph the main thread is idle waiting for worker thread, it is possible to use that grey bar to do some “free work” but in the OP’s code there is nothing more in Update and LateUpdate where we request the completion (even this we are still benefit by the gain)

If we add something worth less than or equal to 14.31ms after scheduling the job and before .Complete of the job at late update it would be like you get it for free. Not to mention we still have 2 more idle worker threads around that time but it depends on machine.

Why there are 2 idle threads when he use IJobParallelForTransform ? One thread is surely “parallel to the main thread” but it should fill up the other threads too unless he got only one transform work to do.

I just encountered this problem in my own game. This is because he manually new ing the TransformAccessArray. Unlike normal parallel for job where you specify the batch count at schedule, you must specify the number of jobs when creating TAA or else I believe you will get “1” so with 1000 transforms and putting 3 it means we want 3 jobs for this TAA, so each job get 333 transforms. If we didn’t specify I believe it means we get 1 job with 1000 transforms. If he did that I think the time could go even lower than 14.31ms

Off topic but if you use ECS and get an injected TAA that ECS made for you it will be automatically nicely threaded when you job it. This is what I get from running TAA job with TAA I didn’t create myself. In this image it looks like the desired job count is 6?

2 Likes

Thanks for the detailed response. I’m think I’m going to abandon using the job system at least for now. My game is a sandbox game, so the user may, for example, want to add 1000 cubes to the scene in random locations and make them rotate. It would seem like a perfect application for using IJobParallelForTransform. But I’m simply not seeing any dramatic improvements when I use the Job system v.s. just having an Update method running for each cube that handles the rotation. 1000+ cubes (or other objects) in the scene at once is an edge case, so If I can’t even see much of an improvement in this scenario, then It’s even less useful for when the scene only has dozens or a couple hundred moving objects.

Furthermore, it just seems like there are a lot of drawbacks and difficulties when using the Job System or ECS. Object Oriented objects with their own Monobehaviors are so much more convenient, and intuitive. Perhaps it makes more sense for a new game written from scratch. My game is nearly completed, and it’s totally written using the old-school object oriented/Monobehavior approach.

I guess I’m not sure where something like IJobParallelForTransform would be very useful, other than in demos (with 10,000 moving sphere or whatever) used to demonstrate how the jobs system works. Most games don’t have thousands of the same objects doing the same thing.

Perhaps I’ll re-visit this at some point.

You do need actual use cases where you could benefit. But in my experience it’s the accumulation of applying jobs to lots of smaller things and maybe one or two medium things that makes the difference. You won’t find any one huge single thing in most games that if you jobify your problems are solved. If only it was that easy.

I jobify stuff to save half a ms in a flash. It adds up. Saving half a ms a dozen times, that’s 6ms. That’s a significant chunk of the 16ms budget you have to hit 60fps. Figure that at least 5ms or so of that 16 is reserved for stuff out of your control like rendering or third party assets, it makes it even more valuable.

1 Like

Since you stated, you game is almost finished, the most hard work, and heavy lifting is already coded. To jobbifiy these, you would need probably significant rework. Which lead to the point, that it will be easier with new game. But not impossible with existing / almost finished one. Question is, do you really need it?

Most games not. But bear in mind, this is not only opportunity to jobbifiy of parralerize game objects itself, but number of systems / subsystems. I.e. AI behaviors, or some specific algorithms.

Could not resist to respond to this. I think if you would write it on forums like stack overflow or similar, you would be mixed with trash in blink of eye, since you would be accused for premature optimization. :slight_smile:

But I fully agree with you, where small things should not be left overlooked. And no one is really mentioning that learning early, how to optimize things, allows benefit greatly in future. Otherwise, ending up with products, which perform poorly, because lack of optimization. And often is too late, or too much hassle, to even look, where to start optimizing.

I assumed it was obvious that the half a ms mattered. The larger point was that it’s the little things that add up. I’d say most 3d games are cpu bound, it’s the first bottleneck you normally hit. It’s rare that you wouldn’t be optimizing for half ms in a 3d game of any complexity.

Not sure anyone has mentioned it in this thread, I did not see it with my quick read through, but to get parallel performance with IJobParallelForTransform the transforms need to have separate parents.

Any transform with the same parent will execute in the same thread, so if all your transforms have the same parent you’re not going to get any threading performance gains.

1 Like

In that case, I presume is better to keep without transform parents at all, if feasible.

From so many forum readings I did, I assume many actually don’t. And promoting such, which is sad.
Sometimes it may be better not to read :wink:

Regardless of whether you are using ECS and the job system manually, as that affects the internal Unity systems too.

Did you watch the recent Unite Berlin 2018? One section in this talk is directly talking about this.

https://www.youtube.com/watch?v=W45-fsnPhJY

3592698--291097--Screenshot 2018-08-10 16.16.50.jpg

So my game used to parent game objects just “for fun/neatness” where the parents are useless and has all identity transforms. Now I have littered them all to the root after watching this talk with no change in game behaviour at all. (at least grouped by dynamically-made scene at runtime)

2 Likes

Why are you using a persistent allocator that is disposed at the end?

 NativeArray<Vector3> _jPosition = new NativeArray<Vector3>(CubeBatchList.Count, Allocator.Persistent);
        NativeArray<Vector3> _jVelocity = new NativeArray<Vector3>(CubeBatchList.Count, Allocator.Persistent);

You should use a temporary allocation. If I remember correctly temporary allocations are faster.

5argon
This is really interesting. Specially part, where was mentioned, that ECS was in mind, at least since Unity 5.4.
So makes sesne that Unity is structuring transforms, for eventually asyn processing.

The game is a sandbox game that let’s users attach scripts to objects (like rotate, move, animate,spawner etc), and I think there are likely a few that could benefit from jobs, but I’d need to go through a coding hassle for each of these. As a test, I had attempted to Jobify the rotation script, but this script is only one of 50 or so scripts, so is only a small part of the game, so it’s not really worth the effort. It would be different for sure if the core of the game involved something that could take advantage of the jobs system. I do use an asset store Voxel terrain system that I’m sure could benefit a lot from jobs during the creation process, but I don’t know the code well enough to make the change myself. Hoping the dev will get to this! I’ll leave this alone for now, I have enough other work to do!