I use the following non-standard launch and completion configuration IJobParallelFor:
using System.Collections.Generic;
using Unity.Collections;
using Unity.Jobs;
using UnityEngine;
public class MainClass : MonoBehaviour
{
public int m_Count = 4;
public List<Subclass> m_Subclasses = new();
private void Start()
{
for (int i = 0; i < m_Count; i++)
{
m_Subclasses.Add(new Subclass());
}
}
private void Update()
{
foreach (Subclass subclass in m_Subclasses)
{
subclass.OnUpdate();
}
}
private void OnDestroy()
{
foreach (Subclass subclass in m_Subclasses)
{
subclass.OnDestroy();
}
}
}
public class Subclass
{
private int m_Size = 32;
public List<JobClass> m_JobClasses = new();
public Subclass()
{
for (int i = 0; i < m_Size; i++)
{
JobClass jobClass = new JobClass();
m_JobClasses.Add(jobClass);
jobClass.Schedule();
}
}
public void OnUpdate()
{
for (int i = m_JobClasses.Count - 1; i >= 0; i--)
{
if (m_JobClasses[i].m_JobHandle.IsCompleted)
{
m_JobClasses[i].m_JobHandle.Complete();
m_JobClasses[i].Dispose();
m_JobClasses.RemoveAt(i);
}
}
}
public void OnDestroy()
{
foreach (JobClass jobClass in m_JobClasses)
{
jobClass.Dispose();
}
m_JobClasses.Clear();
}
}
public class JobClass
{
private int m_Size = 100000;
private NativeArray<int> m_InputArray;
private NativeArray<int> m_OutputArray;
public JobHandle m_JobHandle;
public JobClass()
{
m_InputArray = new NativeArray<int>(m_Size, Allocator.TempJob);
m_OutputArray = new NativeArray<int>(m_Size, Allocator.TempJob);
for (int i = 0; i < m_Size; i++)
{
m_InputArray[i] = i;
}
}
public void Dispose()
{
m_InputArray.Dispose();
m_OutputArray.Dispose();
}
public void Schedule()
{
Job firstJob = new Job()
{
m_InputArray = m_InputArray,
m_OutputArray = m_OutputArray
};
JobHandle jobHandle = firstJob.Schedule(m_Size, 1);
m_JobHandle = jobHandle;
}
public struct Job : IJobParallelFor
{
[ReadOnly]
public NativeArray<int> m_InputArray;
public NativeArray<int> m_OutputArray;
public void Execute(int index)
{
m_OutputArray[index] = m_InputArray[index] * 2;
}
}
}
In general, it works. I can even add a second IJobParallelFor that uses the results of the first one. No errors are thrown, the calculation results are as expected. But very often when working, a lot of warnings are issued.
Warnings first:
Internal: JobTempAlloc has allocations that are more than 4 frames old - this is not allowed and likely a leak
And then the warnings when I try to Dispose the used NativeArray:
Internal: deleting an allocation that is older than its permitted lifetime of 4 frames (age = 5)
What can be wrong? All calculations and results are exactly as expected. And sometimes everything goes without warning. Is it really because the validation check is in Update()?
You are cleaning up native memory only when the MainClass gets its OnDestroy called. Thus from the very first job you run through MainClass to the time where the MainClass runs OnDestroy only four frames must pass, otherwise the warning you receive will be triggered. Also, it could be the case that one of these jobs takes longer than four frames to complete if it performs heavy calculations, triggering the same warning.
Note that this warning doesn’t happen instantly, it is delayed because these checks don’t run every frame. If you want to allow jobs that run more than 4 frames then you need to use the persistent allocator rather than TempJob.
The purpose of Subclass running multiple jobs isn’t exactly clear. This could be done entirely within MainClass, unless you want to re-use Subclass in other “MainClass-like” classes. But even so, you add a certain overhead to running jobs by creating and destroying managed wrapper objects for each.
Furthermore, I wonder what’s the point of scheduling multiple ParallelFor jobs of the same type (!) in parallel? There’s no gain in that, to the contrary. This may actually be the reason why you get these warnings because if you schedule 32 of those parallel jobs, each running in parallel on all available cores, then the remaining 31 parallel for jobs will have to wait for the first to be finished and so on. You are essentially queuing them all up and causing them to wait on each other all the while their JobTemp frame counter keeps ticking.
You may want to reconsider your design here, specifically getting rid of the job-managing SubClass and combining whatever algorithm you want to perform into a single parallel job that does everything, rather than choking the job scheduler with many parallel for jobs running in parallel.
public void OnUpdate()
{
for (int i = m_JobClasses.Count - 1; i >= 0; i--)
{
if (m_JobClasses[i].m_JobHandle.IsCompleted)
{
m_JobClasses[i].m_JobHandle.Complete();
m_JobClasses[i].Dispose();
m_JobClasses.RemoveAt(i);
}
}
}
This is an example from which I threw out everything superfluous to find the problem. Everything is more complicated there. In JobClass, not only calculations are performed, but the results of these calculations will also be stored. And then these classes will be accessed from the outside. Since there are many of them, I did not inherit them from Monobehaviour.
There are not 32 jobs. There are 32 classes that run 10,000 jobs each. Here MainClass creates SubClasses, and they create JobClasses. Again, this is a simplified version of which I threw out everything superfluous. According to it, it is not very clear why this is necessary, but in short, this is the generation of World-Regions-Chunks like Minecraft. I already have a working version of this generation using IJobParallelFor, but the problem is that this version blocks the main thread while generating a large number of chunks, which causes the application to hang.
No, I want to try to implement just such an option. In fact, this is just an attempt to rewrite an already running generator so that it does not block the main thread until all calculations are completed. This, of course, can be done in different ways, but I was wondering how this can be done inside Jobs.
That is just too many jobs. More so if you meant to say 10k jobs per class (320k jobs).
Consider that each job needs to be queued onto a single core, and for parallel jobs each job occupies all cores for the time it is running. I would say you are clearly scheduling too many jobs contesting for a very limited resource (CPU cores), thus you are likely running into a situation where jobs are waiting more than four frames just to start doing their work!
Assuming you run this on a 32 cores CPU (not virtual but actual cores, ie Threadripper) and each job takes 0.1 ms to complete and you run 10k jobs at once then you are looking at a runtime of 32 ms or 2-3 frames for the last jobs to complete. Sometimes jobs take longer than this (I did not account for job scheduling overhead for example) and go over the 4 frames TempJob time.
But if your jobs complete in < 0.1 ms while scheduling 10k of them then each job does too little work to be effective. You’d rather want 10 jobs running 100 ms each … actually, I bet they’ll probably run way faster than this since the data they process is already in the job context, ie in native memory at a single contiguous memory location ready to be eaten by a single core that will heavily optimize data load through prefetching.
At the minimum you should switch to Persistent allocators but I’d recommend rethinking your approach because scheduling 10k jobs, even far less than 1k jobs at any given time, is not an efficient use of jobs. I remember a thread where a Unity employee advised against scheduling even “hundreds of jobs” all at once but I was unable to find it.
My current option, which blocks the main thread until all calculations are done, is 4 32x32 regions, consisting of 16x384x16 voxel chunks. That is, it is 4 x 1,024 x 98,304 (402 653 184 jobs). And he does it in about 18-20 seconds at 11700KF. The same calculations without Jobs take 40 seconds, so even if I did everything not optimally, the performance gain is still 2 times. But the problem is that at the time of generation, the application freezes, waiting for the generation to end. Now I want to try a variant that does the same but without blocking the main thread. I have no idea how this will affect the execution time. I’m just experimenting.
Minecraft does it this way from the top of my head:
world loads …
determine chunk the local player spawns in
generate that chunk
determine chunks visible from current camera’s frustum
loop over those visible chunks
generate chunk
flag chunk as generated somehow
check if there is time left for generating more chunks this frame
if true: continue
if false: break (ie render what you got)
In the next frame chunk generation continues for chunks that are in the camera frustum but have not been flagged as generated or requiring an update. Chunks closest to camera origin (only X/Z is considered) are always generated first. This chunk generation/updating loop continues throughout the entire gameplay.
Chunks that have not been generated do not run any processing / logic. Chunk generation is fully deterministic, otherwise this approach would break.
There is no need for thousands of jobs. One parallel job per 16x384x16 chunk should suffice. Alternatively, if you need to calculate many chunks all at once it may be more efficient to use a regular IJob per chunk whenever you are queuing 8+ chunks at once, for example if for whatever reason you want to pre-generate every chunk in the vicinity at load time.
A performance gain of two is absolutely laughable, if you excuse me saying so. With properly optimized Burst code alone, not even multithreaded, you can see performance gains of 10-100 times easily. Multiply that by number of cores (minus 10-30% due to various multithreading inefficiencies) then that’s what you should be aiming for!
I wonder, why do you need four 32x32 chunks? I would guess that this is for splitscreen (local) multiplayer, in all other multiplayer cases each client calculates only his own visible chunks.
This has already been implemented, just at the start of the game, you can generate the required number of chunks in random order by closing the process with a loading screen. As a result, for testing, I simply set the draw distance to the maximum and at the start 4 regions are generated at once.
Such performance can be achieved in computing. When generating chunks, there are few calculations, the main work is the creation of arrays, their parallel filling and disposing. As far as I know, these processes cannot be accelerated so much.
It’s called a region and it’s standard practice in Minecraft. This makes it easier to find neighboring chunks and voxels.