Is it possible to set core count limits for a particular long running IJobParallelFor job?

In my game I have a number of periodic jobs with extremely long runtimes (3-8 seconds with Burst when running on a single thread w/ highly optimized code). All of the jobs are configured as IJobParallelFor, which works wonderfully at cutting their execution down to under 200 ms on 24 cores…

However I’m getting a side effect now of a very noticeable frame drops as other job systems (physics/animation/HDRP/etc) end up calling complete on the main thread for their jobs, resulting with my whale of a job being flushed to the main thread so theirs can be un-backlogged from the working queue and finished immediately.

I was wondering if there was anyway to manually specify the max number of cores you could dedicate to a particular IJobParallelFor, or perhaps for another variant like IJobParallelForBatch? (I’m not interested in setting the global number of total working threads with JobsUtility.JobWorkerCount – I just want to know if it’s possible to reserve like 1 or 2 of the threads my monster is gobbling up for the poor physics engine.)

There are two ways I can theorize going about this:

A) Manually split up my job into (JobsUtility.JobWorkerCount - 1) * IJob(s) and then build a system to track each job individually / reassemble the output data after – which is possible, but is also a lot of extra overhead for something I think should be fairly simple.

or to do this more indirectly by

B) Specifying the minimum job batch size to be (iteration count) / (JobsUtility.JobWorkerCount - 1) … Which doesn’t flair my interest too much as that basically stops any ability for the job system to balancing work loads by stealing work from other threads…

I’m also not sure if this method would prevent the backlog issue if another (medium length) job already had it’s work scheduled/queued up before the built-in systems (physics, etc.) called complete on their jobs… Seems like I would be just kicking the can down the road in terms of limiting traffic to prevent the bottleneck… While also potentially doubling the length of my job’s total execution if just one of the “giant” job chunks happened to get queued after the first batch of (JobsUtility.JobWorkerCount - 2) threads.

It would be interesting to see how close to a compute shader-like syntax they can get this, but I also wonder if that is, or isn’t their goal.

@Nyanpas if you haven’t seen it, check out https://www.ilgpu.net/

1 Like

@Nyanpas
Um, i’m not sure if by “they/their” is referring to me or not, I assume not.

I didn’t consider using a compute shader because I’m making a very large web-like network of node based jobs which need to be able to pass data data directly using native arrays to avoid ram bloat and needless data copying from the RAM to GPU, and back, or from native arrays to managed c# arrays… Even uploading a 10-15 mb texture to the GPU can cause a 100 ms or so stutter from the graphics thread being bottlenecked form my testing. Also for my app the GPU is under about 80-95% load while running the rest of the app (in VR), so I didn’t want to over tax it even more and as a result get my framerate cut in half by missing the frame timing by a few MS (VR plugins are like VSync in that way).

Then again I don’t know much about compute shaders, generally I’ve just avoided them because of their inability to handle complex if statements on a pixel by pixel basis well.

What do you mean by “getting close to a compute shader-like syntax”? Do they have built in limits for how much of the compute resources go to a single compute shader or something like that?

@jasons-novaleaf I’ll check out your link when I have a bit more time. But I’m doubt that running my longest running nodes to follow that strategy would be of much benefit to me at this point (mostly for the above memory copying reasons).

1 Like

Amazenyan. Thank you.

You can abuse IJobParallelFor for it if you really want to, but it’ll break some of the safety stuff. Like:

  • Queue a IJobParallelFor with 0 … 4 range, with an inner loop batch of 1
  • Don’t actually use the index passed via that, except possibly as a kind of thread ID
  • Pass a shared int pointer, which you increment atomically in a loop, and use that index as the actual index (or return; if the index has passed the desired range).

I’m not sure this will work for your use case, but I’ve used something similar for a case where I wanted to do some work per-thread before processing any indices together with a variable amount of work to be done.

1 Like

Yeah I end up designing jobs which do work per row instead of per index of the time to improve cache hit rates, or because they need to work with sequential chunks of data (like for using a rolling average).

@Zuntatos Thanks for the tip about the shared pointer, I had not thought a about that and it makes sense.

I use Unity - Scripting API: NativeSetThreadIndexAttribute to get job thread IDs mostly in jobs which need working arrays per core once in a while (segmenting a single native array into “sub” arrays (without actually using NativeSlices)).

Yesterday I tried out my proposed B solution and actually it works fairly well. I’ve been using it with some extra code to manage longer jobs (adjust scheduling to avoid overlaps) and it works great now without frame drops for these monolithic jobs.

1 Like

Plan B is fine, but there is no more job balance. Hope it can set worker count one day.

I’d like to +1 this feature request. I really wish there were some tuning options we could provide to the scheduler for jobs. Things like max number of workers to use, or the inverse to say leave at least x workers free, or some priority setting that lets physics, rendering and other basics keep chugging along uninterrupted.

I have potentially-but-not-always longer running procedural generation jobs, and I often have to queue up many (100s) at a time. This floods the job system with a series of tasks, some of which may take nearly no processing (nearly empty terrain chunks for example) and some which may take a second or two to crunch through. I tried making a queue so only some small amount are ever scheduled every frame, but this ends up taking way too long and leaves workers mostly idle for entire frames because it’s not known which are the quick jobs and which are not ahead of time. Ideally, the job system could just be told to not consume all its workers with these tasks, but let the workers which are devoted to them be saturated.

You can hack the batch size to force any particular IParallelForJob to run on exactly however many cores you want, as I talked about above.

I’ll expand on them here in order of their simplicity to implement, and they’re usefullness. Each layer adds a lot more complexity to how careful you need to be when scheduling.

0 – Super easy system wide change:

You can also set the total number of working threads as well; although this effects the available core count for the entire job queue system, not just per job, and is more intended to allow background tasks like screen capturing software running without hiccups. For hat see: https://docs.unity3d.com/ScriptReference/Unity.Jobs.LowLevel.Unsafe.JobsUtility.JobWorkerCount.html

1 – Super Easy Per Job)

Basically for each long running IJobParallelFor set the
batchSize = math.ceil(ItemsToProcess / (# total working threads you want (one or two less than the max the machine supports is what I’ve found works best for me))).

2 – Easy but can get complex depending on job count(s) – and the number of parallel tasks)

Layered on top of that, if you have a problem with multiple IJobParallelFor running at the same time not leaving a background thread open, you can often get creative with their jobHandles to ensure that only one IJobParallelFor is running at a time. I don’t know how much work that would be for you as you’re running several hundred jobs in series; but the processing load I’m using for music analysis is comparable in it’s complexity, and I’ve managed to make it work most of the time; since I started this thread 2 years ago.

Where this doesn’t always work:

For me the one type of frame hiccup I haven’t been able to fully resolve is the occasional job in the middle of a long chain having it’s predecessor complete running, and then some background job which needs to complete this frame for e.g. physics, or animation rigging, or a particle system, or rendering, ends up calling Complete() on it’s job – and that nearly unavoidable inevitability ends up causing that simple job to waiting behind whatever monolithic IJobParallelFor which just happened to enter the job’s runtime queue just before the built in system’ scheduled task gets to call Complete(). It results in very unpredictable and long running frame time spikes as the IJobParallelFor is forced to the main thread, before the new job can be completed.

3 – Total job size optimization – scheduling larger IJobParallelFor into smaller batches)

The only way I’ve found of fixing this to break down the IParallelForJobs into even smaller chunks, so when this does happen maybe it job hiccups, only needs to wait for 3-4x random 10 ms main thread hiccups instead of one massive 300 ms wait (e.g. several of the smaller jobs end up being pulling into the main thread instead of one massive IParallelFor running on 22 cores for 300 ms each). This is also a good way to deal with jobs which have massive runtime ram uses.

Thanks, I’ll try messing with this, but unfortunately the jobs are not all queued at once. They are queued in batches in intermittently as the player moves through the world. There are probably several concurrent batches at times. I presume what you mean by getting creative with JobHandles is making each IJobParallelFor dependent on the previous one (assuming the JobHandle is not complete)? That may work. I’ll give it a shot.

Sure would be easier if Unity just added the feature though. :stuck_out_tongue:

Exactly. Good luck!

My entire audio analysis is set-up as a tree structure, with very clearly defined data flow routes, and a fair amount of code dedicated to addressing this problem. My main strategy thus far has been to give each “node” in the tree one of three node demand levels which - monolithic (anything which run more than a couple hundred ms on a single thread), normal (more than like 10 ms on a single thread), and Tiny (anything less which I don’t care about since it will take less than a few ms). The tree basically just tries to make sure that no “normal” or “monolithic” nodes, are scheduled to run when any other “monolithic” nodes are scheduled to run. The only two tools the job system provides to manage this are 1) when you schedule the tasks, and 2) what job handles you given them as dependents, so essentially everything that needs to run after a “monolithic” node just gets to have it’s final job handle used as a extra dependent (even if it doesn’t actually need that monolith node’s data). I don’t think there are any downsides to passing in already completed handles as dependent for new jobs, which removes a lot of the complexity in terms of scheduling complex systems