Parallel jobs take the same time or longer to execute in total than a single job??

I just noticed that some of my jobs in game run slower in IJobParallelFor than in a single threaded IJob. I thought my code might be very parallelization unfriendly, maybe too small for parallelization benefits to kick in. But I made a little test with a giant array, where you’d think the benefits would be greatest with simplest possible code, and it is also true. I tried turning off leak checks, safety checks, it’s still the same ratio.

Here is how my profiler looks (tested in standalone with attach profiler):

As you can see, they both take about the same total time. It doesn’t make any sense.

Here is my test code:

Code

using Unity.Burst;
using Unity.Collections;
using Unity.Jobs;
using UnityEngine;

public class JobTest : MonoBehaviour
{
    [BurstCompile, NoAlias]
    public struct ArrayJob : IJob
    {
        public NativeArray<float> floats;

        public void Execute()
        {
            int ln = floats.Length;
            for (int i = 0; i < ln; i++)
            {
                floats[i] += 0.34f;
            }
        }
    }

    [BurstCompile, NoAlias]
    public struct ArrayParallelJob : IJobParallelFor
    {
        public NativeArray<float> floats;

        public void Execute(int i)
        {
            floats[i] += 0.34f;
        }
    }

    [BurstCompile]
    public struct ArraySliceJob : IJob
    {
        public NativeSlice<float> floats;

        public void Execute()
        {
            for (int i = 0; i < floats.Length; i++)
            {
                floats[i] += 0.34f;
            }
        }
    }

    NativeArray<float> floats;

    int LENGTH = 16777216;

    void Start()
    {
        floats = new NativeArray<float>(LENGTH, Allocator.Persistent);
    }

    private void OnDestroy()
    {
        floats.Dispose();
    }

    void Update()
    {
        if (Input.GetKey(KeyCode.Alpha1))
            new ArrayParallelJob() { floats = floats }.Schedule(floats.Length, 8).Complete();

        if (Input.GetKey(KeyCode.Alpha2))
            new ArrayJob() { floats = floats }.Schedule().Complete();

#if SLICE
        JobHandle handle = new JobHandle();
        for (int i = 0; i < 8; i++)
        {
            handle = JobHandle.CombineDependencies(handle,
                new ArraySliceJob() { floats = floats.Slice(LENGTH / 8 * i, LENGTH / 8) }.Schedule());
        }
        handle.Complete();
#endif
    }
}

As you can see in code sample I tried manually parallelizing with slices, which works fine with a single slice, but looks running these in parallel is not possible as even non-overlapping slices are not allowed to run in parallel (I thought this was the point of slices).

Is there something I’m doing wrong? I really need to process big arrays so this is crucial to me.

Running in 2019.3.0f6 with Jobs package version 0.2.5

Your batch size is way too small for such simple work. Try bumping up the parallel job’s batch size to something like 256.

Ah, ok, I overlooked that.

5536855--569302--upload_2020-2-29_22-55-18.png

The columns are: single thread, parallel 8 batches, 16 batches, 32, 64, 128, 256, 512, 1024

I tried a few values, it plateaus around 32-64 batches. Still, it only gets 2x faster, nowhere close to 8x faster which I’d expect with 8 threads.

I don’t actually understand the batch count so well. The text in docs is kinda vague. Is that just something you have to test and see where it’s most efficient? And will the batch count equally plateau on all CPUs?

1 Like

Most CPUs with 8 threads actually only have 4 cores and just handle two threads per core.They do that in case one thread gets stuck waiting on memory. So I would expect a 4x speedup in your case. Not sure why that is. I’d have to look at the Burst inspector of each and see if the parallel version is struggling to vectorize something.

The batch count specifies how many items you process before you go looking for more work. Small batch counts lets threads more aggressively “steal” work from other threads, but there’s an overhead for this looking and stealing process. There might be a little variation in where CPUs plateau, but it generally isn’t that big of a variation. The highest batch count you can get away with when all the threads still finish the job at nearly the exact same time is what you want to go for.

1 Like

Now, this is silly. I turned off burst safety and leak checks (and also tried in build), and ran the same test from above. It seems that single and all batch counts give me equal times now. It just seems that the safety checks in editor slowed down jobs with lower batch count.

Huh. You might be memory-bound, which would explain why much of that work-stealing logic (which is not memory heavy but is ALU-heavy) is not costing much if anything. It would also explain why parallel jobs might be not scaling as well.

2 Likes

Yes, indeed, I forgot about the fact it has 4 cores and 8 threads.

Just want to say that in the mean time I tested with splitting the big array in 8 small arrays, and I’m getting identical job total times. With the necessary copies this makes it longer overall compared to IJobParallelFor. So, I assume it’s just my CPU’s limitation that parallel takes seemingly the same time as one job.