I am trying to find the optimal number of jobs to use on iOS. I created an array and a various number of IJob structs to update different parts of the array. I found that regardless of the number of jobs I created, the time to complete updating the array using the job system is always longer than without using the job system (i.e. using just the main thread). Is this expected? Basically, does IJob lead to worse perfomance on iOS?
I tested different array lengths from 100 to 10000 and updating is just incrementing every array element. I tested 1 to 100 jobs.
Packing and distributing the jobs will take up CPU time as well. If the code doesn’t do that much this can take up more than the performance benefit of multithreading.
Maybe ios does do something weird, but then you should see some logs pop up.
You could try with more intense jobs to see if it helps.
You may want to use IJobParallelFor instead. That will be more efficient when it comes to parallel processing. If you really need a job for indices 0-999 and 1000-1999 and so on you can make use of NativeSlice to essentially “split” the array into chunks without actually making copies of the array (I believe).
Keep in mind that it matters a lot when you call Complete(). For example:
job.Schedule().Complete();
If job is an IJob this is essentially blocking the main thread waiting on the background IJob thread to finish. If however you Schedule one or more jobs in Update and call Complete() in LateUpdate this can essentially mean “free processing” during the time between Update and LateUpdate (or even between two frames’ Update/LateUpdate depending on whether you really need the results right away).
Lastly, be sure that under the Jobs menu Burst and Job threads are enabled and under Project Settings => Burst AOT Settings check the platform-specific settings. For measuring you should set OptimizeFor to Performance, this overrides the same settings specified via BurstCompile attribute.
Sorry for not posting my code earlier, but this is what I have. I’m using the Unity Perfomance Testing Extension to measure time. I’m using IJob instread of IJobParallelFor because I can precisely control the number of jobs allocated with IJob but with IJobParallelFor Unity actually schedules “an appropriate number of jobs” according to the docs.
I also tried using [BurstCompile], which gave a performance boost.
I use the command line to run the performance tests and the results are .xml files which I use a python script to parse.
internal class BenchUnityJobSystem
{
private const int WARM_UP = 10;
private const int ITERATIONS = 10;
private const int MEASUREMENTS = 100;
private const int MAX_JOB_COUNT = 100;
private static int[] s_job_counts = InitJobCounts();
private static float[] s_result_array;
private static List<BenchJob> s_jobs;
private static NativeArray<JobHandle> s_jobHandles;
private static int[] InitJobCounts()
{
var job_counts = new int[MAX_JOB_COUNT + 1];
for (var i = 0; i <= MAX_JOB_COUNT; i++)
{
job_counts[i] = i;
}
return job_counts;
}
[Test, Performance]
public static void TestArray100([ValueSource(nameof(s_job_counts))] int jobCount)
{
RunExperiments(jobCount, 100);
}
[Test, Performance]
public static void TestArray500([ValueSource(nameof(s_job_counts))] int jobCount)
{
RunExperiments(jobCount, 500);
}
[Test, Performance]
public static void TestArray1000([ValueSource(nameof(s_job_counts))] int jobCount)
{
RunExperiments(jobCount, 1000);
}
[Test, Performance]
public static void TestArray5000([ValueSource(nameof(s_job_counts))] int jobCount)
{
RunExperiments(jobCount, 5000);
}
[Test, Performance]
public static void TestArray10000([ValueSource(nameof(s_job_counts))] int jobCount)
{
RunExperiments(jobCount, 10000);
}
[Test, Performance]
public static void TestArray50000([ValueSource(nameof(s_job_counts))] int jobCount)
{
RunExperiments(jobCount, 50000);
}
[Test, Performance]
public static void TestArray100000([ValueSource(nameof(s_job_counts))] int jobCount)
{
RunExperiments(jobCount, 100000);
}
private static void InitializeExperiments(
int jobCount,
int arraySize,
NativeArray<float> result
)
{
s_result_array = new float[arraySize];
s_jobs = new List<BenchJob>();
if (jobCount > 0)
{
var startIndex = 0;
var jobSize = arraySize / jobCount;
var bigJobSize = jobSize + 1; // big job has 1 more array element than average job, to evenly distribute array
var bigJobCount = arraySize % jobCount;
for (var i = 0; i < jobCount; i++)
{
var job = new BenchJob()
{
slice = new NativeSlice<float>(
result,
startIndex,
i < bigJobCount ? bigJobSize : jobSize
)
};
startIndex += i < bigJobCount ? bigJobSize : jobSize;
s_jobs.Add(job);
}
Debug.Assert(startIndex == arraySize);
}
}
private static void RunExperiments(int jobCount, int arraySize)
{
var result = new NativeArray<float>(arraySize, Allocator.TempJob);
s_jobHandles = new NativeArray<JobHandle>(jobCount, Allocator.TempJob);
InitializeExperiments(jobCount, arraySize, result);
Measure
.Method(() =>
{
if (jobCount != 0)
{
UseJobSystem();
}
else
{
NoJobSystemNativeArray(result);
}
})
.WarmupCount(WARM_UP)
.IterationsPerMeasurement(ITERATIONS)
.MeasurementCount(MEASUREMENTS)
.Run();
result.Dispose();
s_jobHandles.Dispose();
}
public static void UseJobSystem()
{
for (var i = 0; i < s_jobs.Count; i++)
{
var job = s_jobs[i];
s_jobHandles[i] = job.Schedule();
}
JobHandle.CompleteAll(s_jobHandles);
}
private static void NoJobSystemNativeArray(NativeArray<float> result)
{
for (var i = 0; i < result.Length; i++)
{
var temp = result[i];
temp += 1;
result[i] = temp;
}
}
// [BurstCompile]
public struct BenchJob : IJob
{
[NativeDisableContainerSafetyRestriction]
public NativeSlice<float> slice;
public void Execute()
{
for (var i = 0; i < slice.Length; i++)
{
var temp = slice[i];
temp += 1;
slice[i] = temp;
}
}
}
}
ALWAYS use BurstCompile! You’ll waste a ton of performance not doing so. In fact, it’s not uncommon to see a far greater gain in performance by using Burst compared to parallelizing a job without Burst!
For example, I do some parallel processing in the editor on meshes with Jobs and this finishes in 20 ms with Burst enabled:
Now I go to Jobs => Burst => Enable Compilation [unchecked]:
Posting screenshots in case you won’t believe me.
More so, when I would run this process the way I had it before I ported it to Jobs and Burst, this would have taken … roughly 5 seconds minimum.
From the kind of tests you perform I strongly recommend to stop wasting your time on profiling artificial stuff like that. Write actual game code, profile that, then optimize if necessary! You’ll eventually get a grip on what is better in which scenario but unless you’re actually computing something meaningful in these jobs you will not gain any insight.
Any performance difference you’ll see in tests like these will very likely not matter or even give a false impression of whats faster when you get to writing actual game code.
The editor is not a fair comparison really; the editor runs mono, where ios builds are on il2cpp. Also, the editor has collections checks (which burst can strip out if you have burst safety checks off), where ios builds don’t. Burst still helps in builds, sometimes by a lot, but it’s rarely by the same insane factor that it is in the editor.
That said, the original experiment as written is not necessarily a fair comparison either. The job system version allocates a large managed array for every experiment (and then seemingly not use it?), which will take time by itself and will also generate GC pressure, which could easily spike during the test execution.
Also, the jobs access the data via nativeslices, whereas the main thread accesses the nativearray directly. I don’t know if this explains the whole difference; there is also scheduling overhead, which we have worked hard to reduce in 22.2. You can also burst the schedule site in 22.2, which should help with that overhead. Said overhead will be more apparent the less work each job does, so if you see the jobs get relatively faster as the workload gets bigger, that could be what’s going on.