Best strategy to spread burst job load across multiple workers?

I have a scenario where I am creating a heightfield fairly frequently which can process thousands upwards to hundreds of thousands of verts depending on the heightfield resolution. Burst does a great job at processing this but as the size of the verts needed to be processed increases its obvious that there are plenty of idle job workers that seem to be itching to get into the action. You can see the two screen shots for one lower resolution (20k’ish verts) and one higher resolution (70k’ish verts) attached.

My question is what is the best way to split up the burst job across workers? I’d like to even the load so my 70k job at 12ms becomes more like 2ms split across 6 workers instead of just 1.

My instinct is to split the X amount of verts across Y workers. It is unclear how I would go about scheduling across multiple workers intentionally. Do I have to manually splice the NativeArray across multiple Job.Schedule()'s then join everything back up? How do I know how many jobs to spawn, is there some sort of lookup to how many workers are available? It seems like I’d want the scheduler to handle this, specifically as you can see in the 2nd screen shot I am spawning some other jobs to at the start of my system and I would want to avoid hogging up resources, I suppose I could segment the processing so those two job types don’t run in parallel… There is also the observation that the amount of vertices is not linear to the time the burst job takes which makes me think that at lower vert counts I need to schedule across less workers and vice versa for higher vert counts. I’ve read that batching is more of an art than a science at times, is this one of those scenarios and even still what are the particular classes/methods I should be looking into?

Also open to any talks / resources on this particular topic so I can deepen my understanding of these types of optimizations.

Here is the job:

[BurstCompile]

.. inside a system execute ..

var noiseJob = new NoiseJob
                {
                    PseudoRandom = localPseudoRandom,
                    Settings = localNoiseSettings,
                    Verticies = new NativeArray<float3>(localNoiseSettings.GetNumSamples(), Allocator.TempJob, NativeArrayOptions.UninitializedMemory),
                    Normals = new NativeArray<float3>(localNoiseSettings.GetNumSamples(), Allocator.TempJob, NativeArrayOptions.UninitializedMemory)
                };
                var noiseJobHandle = noiseJob.Schedule();

..

    public struct NoiseJob : IJob
    {
        [ReadOnly] public PseudoRandom PseudoRandom;
        [ReadOnly] public NoiseSettings Settings;
        [WriteOnly] public NativeArray<float3> Verticies;
        [WriteOnly] public NativeArray<float3> Normals;

        public void Execute()
        {
            float invResolution = 1f / Settings.Resolution;
            int numSamples = (Settings.Resolution + 1) * (Settings.Resolution + 1);

            for (int x = 0, i = 0; x <= Settings.Resolution; x++)
            {
                float xStep = (x * invResolution) + 0.5f;
                for(int z = 0; z <= Settings.Resolution; z++, i++)
                {
                    float zStep = (z * invResolution) + 0.5f;
                    float3 deriv;
                    float height = NoiseEcsLib.PerlinFractalSumDeriv(PseudoRandom, new float2(xStep + Settings.Offset, zStep + Settings.Offset), out deriv, Settings);
                    Verticies[i] = new float3(xStep - 0.5f, height, zStep - 0.5f);
                    Normals[i] = new float3(-deriv.x, 1, -deriv.y);
                }
            }
        }
    }

I always forget to attach things in emails and posts!


Sorry I haven’t read full post a it lengthy. But from what I gather you just process vertices, which are NativeArray.
I’d so, you can Use IJobParallelFor, for multithreaded array computing.

I have not used this yet, looks like it’s what I’m looking for. According to the docs https://docs.unity3d.com/ScriptReference/Unity.Jobs.IJobParallelFor.html

I need to specify the batch size. Is this subjective to the work being performed? In my case I think I just need to set this to something like 10k. Will test shortly…

Edit: seems a little more complicated when wanting to use burst, might need to use the batch… Will be getting back to this particular implementation after wrapping up another task.

10k batch size will be too much. You need consider devices with 2,4,6,8 or even 16 cores.
That doubling to get threads.
So for 6 cores you can have up to 12 workers at given time.
So for 40k you would need split far less than 4k. However batch size values like 128, 265, 512 and 1024 seems resenale in your case. Easier to control workers on les granual level.
Which batch size exactly, that you may want to stress test, using for example profiler.

As a form of clarification are you taking into consideration the burst compilation I am hoping to achieve? If I use IJobParralelFor it seems like this counter-acts the nature of wanting to use burst where as batching I’d be able to slice up the array as hoped. I’m starting to work on this now, will post back soon after digging into the docs and trying it out.

For burst it doesn’t really matter.
It is about utising threads most efficiently, where either threads count differ, or size of array changes.

Basically you want to avoid situation, where some worker sits idle.
Consider you split 30k array into 3 batches of 10k each.
Now having 2 cores with 4 threads, you will utilise only 3 threads, until job is done.
For 8 cores that is even bigger waste, of potentially idle time.

@Antypodish I understand wanting to spread the work load across, that was the motivation for the post. I wasn’t aware that the burst compiler was smart enough to span across the batched jobs and was confusing this need to want to use the IJobParallelForBatch. It seems like the burst compiler does what I wanted in conjunction with IJobParallelFor, which was spreading work across available workers based on the NativeArray size input.

I’m still confused however as to what the 2nd parameter of schedule is really doing under the hood. It seems like changing the value from 1 to 1024 had only a small difference, and the difference between 16 and 1024 was almost non existent. I’ve tried looking at this diagram which helped, but I’m still confused. Unity - Manual: Parallel jobs

Can you elaborate on the mental gap I have around the second parameter in relation to the referenced diagram?

Thanks for your help, you always seem to always be quick to dive in and help others on these forums :slight_smile:

I’ll pitch in to help. So in that diagram, that second argument would be “2”. That’s why each batch has exactly two elements. What happens is something similar-ish to this:

while (getParallelBatch_Proprietary(out batchStartIndex))
{
    for (int i = batchStartIndex; i < batchStartIndex + batchSize; i++)
    {
        job.Execute(i);
    }
}

Here, getParallelBatch_Proprietary (in reality it is called something like GetWorkStealingRange) is a function that is very cheap but not free. Also, batchSize is your second argument when scheduling the parallel job.

So a batch size of 16 means that the getParallelBatch_Proprietary is called 1/16th of the time. And given it is already pretty cheap compared to the rest of the job, there becomes a point where its performance cost, while still existent, is unmeasurable.

Hmm. I’m confused though if I set the batchSize to something really high wouldn’t that mean that a single batch would also be really full and would only be executed on a single worker? This is not the case from my observations. Is this where the stealing paradigm comes in? This states that it only steals batches, not individual job executes.

Let me try to elaborate on my understanding & confusion below. Perhaps I am drawing the construct incorrectly, can you provide a bulleted example like below?

int[ ] arr = new arr[1000];
Schedule(arr.length, 1000);

  • NativeJob 1

  • Batch 1

  • Execute 1

  • Execute …

  • Execute 1000

Yet this gets executed across multiple workers?

int[ ] arr = new arr[1000];
Schedule(arr.length, 4);

  • Native Job 1

  • Batch 1

  • Execute 1

  • Execute 2

  • Execute 3

  • Execute 4

  • Batch 2

  • Batch 250 (arr.length / 4)

Which now begs the question how does it know how to split up the native jobs? Still missing the dots in my head :confused:

Either your observation is wrong or you found a bug. Care to share code and profiler timeline captures?

Don’t worry about details. There is no need going down to rabbit hole.
Unless you want some micro optimization.
Don’t try to make a code, which relies on the order of splitting array, or something like that.
Chances are it will break, as the order per thread is not guaranteed.

Just make it work and test it.
Watch profiler threads, what happens. Lots of things become clear.

@DreamingImLatios There is no bug I’m simply speculating a hypothesis that is wrong because I don’t understand how the jobs are getting scheduled and batched in relation to the batch size parameter.

@Antypodish I already did make it work I’m just trying to understand what the batchSize parameter is doing so I can make the proper changes when needed. It’s not needed atm as the differences in results are negligent and I’m happy with the increase in performance since refactoring to use IJobParallelFor over a bulky IJob. Just starting at 1 batch size and going up is okay advise I guess, just trying to understand how the load is distributed. There is nothing clear about what changes when changing this parameter when analyzing the profiler, not that I can tell anyways.

I guess I’ll just accept the magic and be happy my burst jobs get evenly distributed across workers for now, I just won’t understand when I need to increase the batch size, and still a little confused on what the proper use case for IJobParallelForBatch would be.

edit & just to clarify what is currently happening

I have a 80k float3’s that I write to in a burst job. I converted this to IJobParallelFor and the work gets distributed evenly across across the workers (10 in my case, so 80k/10 verts are getting processed per burst execution). Updating the batch size parameter doesn’t notably do anything when looking at the profile analyzer, maybe I need to look again but it seems at higher batch counts maybe the “stealing” is occurring when something finished before something else? idk.

I don’t know what you are testing and observing anymore, so how about you try this. Set the batch size to the total number of elements. Then run it. Notice how many threads are being used? Then set the batch size to half the number of elements, then a third, then a fourth, and so on until you match your worker thread count.

Now continue to decrease the batch size using values that don’t divide evenly into your number of threads. Notice what happens to the range of time periods when each of the threads finish.

Now go back and reread our previous posts. Make sense yet?

1 Like

Okay, I’m all cleared up now.

Batch size equal to total number of array elements triggered only 1 job, dividing by 2 as suggested split it to 2 jobs, by 4 to 4 jobs and so on, getting to the point where it evenly distributes across all available workers.

So long story short I guess when I was doing the original testing I was making a false observation, I don’t think I was setting the batch size to the proper array size. You and I had the same experiment I just setup the scheduling incorrectly which invalidated my experiment. I thought I was seeing no difference to the scheduling but again I must have not been using the right configuration.

A further point of confusion was how the burst compilation came into things which Antypodist was trying to tell me doesn’t matter. Which I now understand. What I was having a hard time seeing was how the Execute function gets called from a NativeJob which you provided some sample code which helped elaborate that point. The burst compilation would happen at the NativeJob level, which explains how everything still gets bundled together as expected and hoped for. When testing without burst I noticed that the exact same amount of jobs were spawned which reiterated / proved this point.

Makes sense to me now, I suppose the day break and coming back with fresh eyes helped :slight_smile:

Thank you for the help and patience :slight_smile: :slight_smile:

6642499--758248--batchsize-5625.png
6642499--758251--batchsize-11250.png
6642499--758254--batchsize-1.png
6642499--758260--batchsize-512.png
6642499--758263--batchsize-512-without-burst.png

1 Like

Closing remarks: it sure is nice having an easy to use job scheduler + burst :smile: