Job System not as fast as mine, why not?

When I heard about the job system, I started to rewrite my code for parallel processing to be ready for it. This was my first time writing something like this as well. I figured the Job System would do it much better. My code takes 60 ms to execute and the Job System takes 120 ms to execute. I did benchmark this with development build unchecked and outside of unity editor. Both of these use 2 arrays. One is an int storage array for processing(no telling how many results will come back, but in this case around 120k) and the other is a byte array of pixel data about a 4000x1500 image so 4 bytes per pixel, 24 million sized array.

In my code I use a linkedlist for the results and in the Job System I just set the native array a little bigger than the 120k just to see this performance. The other array I just use a byte array as thats what pixel data already is, and in the Job System I just use another Native Array but this one is read only. I used 64 for the Job System schedule that about seemed the best performance. The Execute code of the Job System did about 50 ms. The other 70 ms is spent on the 2 arrays, pretty much all of it is the pixel data array though.

In my code I can’t really benchmark without the array creation. I tried to use what you guys said the Job System would do so it would be somewhat similar. What I do is split the job up over how many processors the system has(in my case 6). Each processor gets a linked list to return as a result and each one gets a chunk of the 4000 pixels, 2 processors get 666 pixels and the others get 667. Then I use System.Threading.Tasks.Parallel.For which I set the for loop to do 6 one for each processor. Dividing up the chunks keeps my code safe and having individual results makes sure I have no collisions. Now none of this dividing up is static. It’s just as flexible as the Job System, it can handle any sized image and amount of processors, so I have that performance hit as well.

Other than array creation the code is the same. Even the 50ms comes awfully close to mine and if I could subtract out my array creation I’d probably beat it.

Any thoughts on this?

  • read @dyox posts
  • wait for the official release of the jobs system

If you posted the sample code in question, we will likely be able to provide you with better answers.

A couple general thoughts on what to expect:

  1. In the editor NativeArray vs builtin arrays have debugging overhead:
  • We detect race conditions
  • in IJobParallelFor we detect writing to wrong ranges of indices
  1. Mono JIT itself has dedicated instructions for array access, in mono we can’t get the exact speed as array lookups. However IL2CPP we are on par/better comparing NativeArray vs builtin arrays. We expect that these days most of our users use il2cpp for the final deployed game for the best performance. So please measure with IL2CPP in standalone player. (Also see note below for latest build with some optimizations that will make it into 18.1)

  2. The Job scheduler in unity is significantly less overhead. Best way to measure is to schedule a bunch of empty jobs. Again editor has quite a bit of overhead due to race condition detection. So its important to measure in standalone player. There are two important things to measure

  • GC allocations caused by scheduling a job. Our view is that keeping it to zero is critical to avoid GC collections later on. We do that, ParallelTasks very much does not
  • Cost to actually schedule + execute
  • Cost of actually running in harmony with other engine threads. (Reducing context switch cost) Unity Job system uses the same job system as engine code allowing for greater integration & no context switch cost
  1. Ultimately neither mono nor IL2CPP performance really matters. The compiler we expect all users to use for C# jobs is Burst. This will NOT be available in 18.1 but likely in 18.2. Burst itself does not know what builtin array is. Essentially burst is a compiler dedicated to the problem of making C# jobs and a specific subset of C# to get the absolute best performance you could hope for. For this reason we make the assumption that there are exactly no GC types in the type of code that burst executes. Hence everything is Native containers + structs. This is a part of what enables the 5x-10x speedups we are usually seeing in burst vs mono/il2cpp. Also we generally beat C++ performance by good margins with Burst already.

You probably want to watch this for a more complete overview of what we are aiming at:

It would be great if you can share the specific benchmark you made so we can take a look.

Note on 2). These il2cpp optimizations are not yet in the just beta. Here is a build from a branch that will soon make it into the official beta builds so you can do the benchmark tests today:
https://beta.unity3d.com/download/966b48dc5f14/public_download.html
(Build has not gone through QA, so i dont recommend using it beyond benchmarking)

15 Likes

So I extracted the code out into an example project. Your job system performs pretty well. 27-28ms every time. Mine has this weird varying of 30, 60, and 90ms. In the editor I get a pretty steady 45ms with mine. Now to the weirder part. In my game in a non-development stand-alone build yours does 120ms and mine does something crazy like 400ms. But in the editor mine will give me the 60ms time. I know an example would help, but I’m just not seeing the issue in the example except for the part where mine runs slower outside the editor.

Seems I need the Windows SDK to build IL2CPP, I will try that.

Will all of my code benefit from burst as well? Or just what’s utilizing the job system. I thought burst would optimize my math and thinking all of my code would benefit.

How are we supposed to use the performance monitor if the job system is going to have huge safety checking overhead while in the editor? Is there a way to skip the safety overhead while using the editor?

Burst is a compiler specifically made for the C# job system. It is built specifically to take advantage of all the restriction we place on C# jobs anyway to get incredible speedups. It can not be used to run generic main thread C# code or other code that is scheduled via .NET Tasks etc.

4 Likes

Okay, I’m not sure why but I’m getting very varying results and I’ve also managed to make the time in the standalone job system worse no idea how. I have created an example project. Here’s my testing results.

In Editor:
Job System: ~400 ms
Mine: ~50 ms

Non Development Mono Build:
Job System: 120ms
Mine: ~400ms

Non Development IL2CPP 2018.1.0b9:
Job System: 250ms
Mine: 40ms

I have uploaded an entire project containing just the code in question.

Here is also just the code and the test image I have used. If you setup your own project, you’ll need to set the .NET to version 4 and restart unity and make the image uncompressed 4k with read write permissions and create a folder called Resources and put the image in there.
Code

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using Unity.Collections;
using Unity.Jobs;

public class Test : MonoBehaviour {

    System.Threading.Tasks.ParallelOptions options = new System.Threading.Tasks.ParallelOptions();
 
    void Start () {
        Time.timeScale = 0;

        options.MaxDegreeOfParallelism = System.Environment.ProcessorCount;

        Texture2D textureImage = GameObject.Instantiate(Resources.Load<Texture2D>("testimage"));

        byte[] spriteData = textureImage.GetRawTextureData();
        int width = textureImage.width;
        int height = textureImage.height;

        UnityEngine.UI.Text JobBenchText = GameObject.Find("JobBenchText").GetComponent<UnityEngine.UI.Text>();
        UnityEngine.UI.Text MyBenchText = GameObject.Find("MyBenchText").GetComponent<UnityEngine.UI.Text>();

        var Timer = new System.Diagnostics.Stopwatch();

        //Benchmark the Job System
        Timer.Start();
        var results = new NativeArray<int>(22000, Allocator.Persistent); //Using this as a phony results list
        var spriteDataNative = new NativeArray<byte>(spriteData, Allocator.Temp);

        var job = new DoJobSystemTest()
        {
            spriteData = spriteDataNative,
            results = results,
            width = width,
            height = height
        };

        JobHandle jobHandle = job.Schedule(width, 200);
        jobHandle.Complete();

        results.Dispose();
        spriteDataNative.Dispose();

        Timer.Stop();
        JobBenchText.text = Timer.Elapsed.ToString();

        //Benchmark My Parallel Processing
        Timer.Reset();
        Timer.Start();

        DoMyParallel(width, height, spriteData);

        Timer.Stop();
        MyBenchText.text = Timer.Elapsed.ToString();


        Time.timeScale = 1;
    }

    struct DoJobSystemTest : IJobParallelFor
    {
        [ReadOnly]
        public NativeArray<byte> spriteData;

        [ReadOnly]
        public int width;

        [ReadOnly]
        public int height;

        public NativeArray<int> results;

    

        public void Execute(int x)
        {

            byte colorA;
            byte colorB;
            int index;

            for (int y = 0; y < height;)
            {
                index = (x + y * width) * 4 + 3;
                colorA = spriteData[index];
                if (colorA != 0)
                {
                    if (y + 1 < height)
                    {
                        colorB = spriteData[index + width * 4];
                        if (colorB == 0)
                        {
                            //No NativeList at this time
                            //results[cpu].AddLast(x);
                            //results[cpu].AddLast(y);
                            y += 2;
                            continue;
                        }
                    }

                    if (y - 1 > 0)
                    {
                        colorB = spriteData[index - width * 4];
                        if (colorB == 0)
                        {
                            //No NativeList at this time
                            //results[cpu].AddLast(x);
                            //results[cpu].AddLast(y);
                            y++;
                            continue;
                        }
                    }

                    if (x + 1 < width)
                    {
                        colorB = spriteData[index + 4];
                        if (colorB == 0)
                        {
                            //No NativeList at this time
                            //results[cpu].AddLast(x);
                            //results[cpu].AddLast(y);
                            y++;
                            continue;
                        }
                    }

                    if (x - 1 > 0)
                    {
                        colorB = spriteData[index - 4];
                        if (colorB == 0)
                        {
                            //No NativeList at this time
                            //results[cpu].AddLast(x);
                            //results[cpu].AddLast(y);
                            y++;
                            continue;
                        }
                    }

                    y++;
                    continue;
                }
                else
                {
                    y++;
                }
            }
        }
    }

    void DoMyParallel(int width, int height, byte[] spriteData)
    {
        LinkedList<int>[] results = new LinkedList<int>[System.Environment.ProcessorCount];

        //Used for splitting up the width of the image between processors
        int[] splitCount = new int[System.Environment.ProcessorCount + 1];

        float count = (float)width / System.Environment.ProcessorCount;

        //for an amount that doesn't divide evenly add the left overs to the other processors batch
        for (int i = 1; i < Mathf.Round((count - (int)count) * System.Environment.ProcessorCount) + 1; i++)
        {
            splitCount[i] = 1;
        }

        //initialize the results linkedlist for each processor and add the batch amount to all processors
        for (int i = 0; i < System.Environment.ProcessorCount; i++)
        {
            results[i] = new LinkedList<int>();
            splitCount[i + 1] += (int)count + splitCount[i];
        }



        System.Threading.Tasks.Parallel.For(0, System.Environment.ProcessorCount, options, cpu =>
        {
            byte colorA;
            byte colorB;
            int index;
            for (int x = splitCount[cpu]; x < splitCount[cpu + 1]; x++)
            {

                for (int y = 0; y < height;)
                {
                    index = (x + y * width) * 4 + 3;
                    colorA = spriteData[index];
                    if (colorA != 0)
                    {
                        if (y + 1 < height)
                        {
                            colorB = spriteData[index + width * 4];
                            if (colorB == 0)
                            {
                                //results[cpu].AddLast(x);
                                //results[cpu].AddLast(y);
                                y += 2;
                                continue;
                            }
                        }

                        if (y - 1 > 0)
                        {
                            colorB = spriteData[index - width * 4];
                            if (colorB == 0)
                            {
                                //results[cpu].AddLast(x);
                                //results[cpu].AddLast(y);
                                y++;
                                continue;
                            }
                        }

                        if (x + 1 < width)
                        {
                            colorB = spriteData[index + 4];
                            if (colorB == 0)
                            {
                                //results[cpu].AddLast(x);
                                //results[cpu].AddLast(y);
                                y++;
                                continue;
                            }
                        }

                        if (x - 1 > 0)
                        {
                            colorB = spriteData[index - 4];
                            if (colorB == 0)
                            {
                                //results[cpu].AddLast(x);
                                //results[cpu].AddLast(y);
                                y++;
                                continue;
                            }
                        }

                        y++;
                        continue;
                    }
                    else
                    {
                        y++;
                    }
                }
            }
        });

        //Process the results

        //LinkedListNode<int> node;
    
        for (int cpu=0; cpu < System.Environment.ProcessorCount; cpu++)
        {
            /*node = results[cpu].First;
            for (int i = 0; i < results[cpu].Count; i+=2)
            {
                DoStuff(node.Value, node.Next.Value);
                node = node.Next.Next;
            }*/
            results[cpu].Clear();
        }
    }

    // Update is called once per frame
    void Update () {
    
    }
}

3402526–267864–ParallelBenchmark.zip (58.4 KB)

1 Like

@TBbadmofo Thanks! We will take a look!

2 Likes

How does IL2CPP play with Burst compiled code? what does Burst generate? managed or native binaries?

Burst produces machine code for the target hardware. Burst transforms a subset of .NET bytecode (The subset defined by C# job system and some more) → machine code.

We will have more information on burst later on. We are not aiming to ship burst as part of 18.1. We are simply talking about it because I believe its important to understanding the whole concept of C# jobs and all the restrictions we place for C# jobs. To a large extent the restrictiveness of C# jobs is based around them being the same restrictions that allows Burst to produce machine code with such incredible performance gains.

3 Likes

Is it worth doing benchmarks to compare the various parallel/multi-threaded options available to developers and how they compare with various game related tasks/processes?

Then when Burst is brought online it can show off it’s performance advantage.

I don’t want to derail this topic more with Burst talk, what would be most appropriate forum section to post Burst related discussions? I couldn’t find any obvious place where I should post, https://forum.unity.com/forums/experimental-scripting-previews.107/ seems most suited place for this purpose but there isn’t any preview build for Burst yet so it’s bit out of place there as well.

Curious, is Burst similar in design to LLVM or is it purely JIT optimizations?

Burst transforms a subset of .NET bytecode → machine code.

from above topic, sound like it is more of a compiler/assembler type technology.

Of course, but knowing whether it’s designed more like LLVM/.Net Native or just purely a JIT thing gives more insight into the overall direction Unity is taking in this area. There are different approaches to the problem.

OK good point is Burst a native compiler or JIT compiler, sounds like it’s a native compiler IMHO.

Burst is uses LLVM as part of the compiler stack. With additional optimizations on top of what LLVM provides.

The integration into the editor is done as a JIT on a per job basis. Meaning that we never interrupt workflow waiting for jobs to compile. (As you would expect, compiling a large job and applying all optimization passes can take multiple seconds) We also cache them between changes to scripts if nothing in the script change affected the compilation.

6 Likes

@TBbadmofo

I had a look at the sample you provided today. I made all of the buffer creation static, they are irrelevant to the case you provided and just created noise.

The numbers you provided are correct and the reason why there is a performance difference lies in the way you have implemented your sample. In this particular case Parallel.For ends up executing differently than what IJobParallelFor does.

We will provide IJobParallelForBatch soon that handles this specific case. When I measure the Parallel.For and IJobParallelForBatch codepaths against each other, we get roughly equal execution time.

For clarity, the tests were run without Burst enabled.

I will post the updated code in a bit.

6 Likes

I think vague benchmarks, or even just guidelines, would be critical for a number of reasons… right now people mostly use Coroutines for everything task/job related.

Once Unity C# Jobs and .NET async/await (and therefore Tasks) become widely available in Unity 2018.1, users need to be educated about what tool is appropriate for what situation. (Myself included… e.g. are there scenarios where writing new code using Coroutines is a good practice, once 2018.1 lands?)

I don’t think there ever was any good reason to coroutines, except for https://docs.unity3d.com/Manual/BestPracticeUnderstandingPerformanceInUnity3.html

But trying to show people why coroutines were bad was a futile gesture on my part. That’s probably why so many Unity games run smooth then pause to clean up garbage and so on… leading to a bad rep. It’s best not to use them unless you’re willing to manage the memory behind them (IMHO).

3 Likes