Re-ordering work inside job without creating new jobs

I’m working on a few things (data crunching/parallel memory range access), but I’m trying to find a way to work with the cores/threads in the job without creating new jobs (I could also work on max threads <= 32?). Although this was fast creating job on top of job on my previous iteration (that was a bit sort job), think I can do better with less code! Have left out the actual data part to focus on this part. I feel that this might be a lot more efficient than coming out of the job only to recreate it (no core will RW the same data range). I know that race conditions is the challenging part, but I can only see this as a possibility if a thread finds the same next core at the same time (highly unlikely but still possible!)

Anyway, constructive thoughts welcome, and if there is a better way to do this. I mean is there some way to lock a variable in a job? ie. bool LookingForNextCore = true / don’t look for next core while true :slight_smile:

public unsafe struct NativeCoreJob : IJobParallelFor
{
    public uint maxCore;                        // Set to 1u << maxCore
    public uint coresActive;                    // Initial coreActive set to maxCore, to start on 1 core

    public void Execute(int i)
    {
        // Bit point of this core (of max 32 cores) - starting point always the max core, activated below
        uint thisCore = 1u << i;


    CoreStart:

        // While some cores are active, and coresActive bit point i not equal thisCore(active), wait
        while (coresActive != 0 && (thisCore & coresActive) != thisCore) { }

        // If no cores are active, assuming all work ended across cores, return
        if (coresActive == 0)
        {
            return;
        }

        /* Do work */

        bool noMoreWork = true;

        // If no more work on this core, set core Inactive and goto CoreStart
        if (noMoreWork)
        {
            coresActive &= ~thisCore;
            goto CoreStart;
        }

        // If work to split, find next free core
        uint nextCore = maxCore;
        while (nextCore != 0 && (nextCore & coresActive) != nextCore)
        {
            nextCore >>= 1;
        }

        // If no cores free, will need to queue work for this core
        if (nextCore == 0)
        {
            /* Queue right range for this core */
        }
        // Otherwise, queue work for the next core, then set coresActive bit to 1
        else
        {
            /* Queue right range for next core */
            coresActive &= nextCore;
        }

        goto CoreStart;
    }
}

Really?

I refuse to comment. :stuck_out_tongue_closed_eyes:

So you already have a parallel job. Give it something todo in parallel! I honestly have no idea what you are trying to accomplish here. You cannot force jobs onto cores or whatever. This is best left to the job scheduler.

I’m not defining which core does the work I know that’s not possible (my wording is just what I’m trying to define). But if say I have 8 indexes, each working on a different range of memory, where do threads come into this, or are multiple threads going to be working on each index regardless? Presumably they would be smart enough to figure out each index is working on a different memory range but not sure! I’ve done plenty of the usual parallel jobs, just trying something a little different. Since I’m splitting up memory ranges, my previous method was to extract this information and schedule a new job. Might be inefficient with each index checking to see if it can do anything each cycle (until active).

If I can get this to work, I’ll test it on my previous working parallel bit sort algorithm, but also on parallel compression WIP. As for labels, they are readable for me as they are bright yellow, only used here to make it more English, before making it less English while(true){ {break} or {continue}}. :slight_smile:

I’d strongly suggest reading the source code for IJobFor and IJobParallelForBatch. You’ll then have a better understanding of how indices get distributed across threads. Basically, each thread grabs a set of indices using JobsUtility.GetWorkStealingRange(), does the work, and then repeats until there are no indices left.

When you schedule a parallel job, the job system schedules a job struct for each worker thread. However, not all threads need to be active for the job to be processed or exhaust all indices. And the threads themselves pick up the job structs. If you have a few single-threaded jobs at the same time, sometimes one worker thread will pick up two instances of the same job struct, with the second one immediately returning because the indices have already been exhausted.

Thank you, this is definitely something you know your stuff on! I’m always on my own optimisation adventure/misadventure. While I probably confused things by talking about cores, I actually want each thread to grab 1 indice, and hope that threads are shared automatically among cores. The memory range which is used per thread (at least in this BitSort case) probably cannot really be processed in parallel, only segments of the whole range per thread. Also the first/only thread will work on the whole range, likely split per cycle.

Of course this can be done by completing the job, and then starting a new one (which I did before), but I’m trying to avoid that. Maybe rescheduling is slower, not sure, but also tidier all in one place with less code. Therefore I schedule 32 length (about right, although 64 would be possible with a ulong) and each Execute(threadIndex) remains active until all threadIndexes are completed within the job. The threads active are stored in the uint bits.

The tricky part is how to communicate that across the indices. I decided that having a searchLock might work together with Interlocked. The time between searching and releasing will be negligible, likely that threads won’t be queuing up together.

Also I’m not sure what distributes the threads, and whether that changes if I’m still within the same function Execute(threadIndex). I wouldn’t want 10 threads running on 1 core, not sure what handles this part, the OS depending on load? Did a little research and apparently Thread.Sleep(1) stops overusing the CPU while waiting.

Suppose I would see what happens on the profiler when I get to test it! Anyway, your thoughts on any of the below will be helpful. :slight_smile:

*small error on using ‘is ZERO’ rather than == ZERO. Actually though is might be more english for == :wink:

Unity creates a fixed number of threads for you at the start of the application. Unless you use a very specific API, it won’t spin up additional threads. It definitely does not create new threads for each job.

This means that the number of threads running, and the number of indices you specify are completely independent. A given thread might grab 5 indices, or it might grab 0. But if your innerloop batch count is 1, then each thread will at the very least only grab one index at a time, and do that work before it tries to grab another. Thus indices are distributed on a first-come-first-serve basis, where threads can help themselves to seconds or thirds even before the other threads are ready to jump in.

Unity will also sleep threads when jobs aren’t running. And sometimes the OS may suspend a thread from running temporarily if there are other awake threads across all processes demanding execution. This means it is possible that in a parallel job (especially one with not a lot of work), one thread might consume all the indices before any of the other threads even have a chance at seeing the job. That’s not necessarily a bad thing. You want your code to work under these and similar circumstances.

As for thread distribution across cores, the OS generally decides that, and it will try its best to load-balance things. A thread isn’t tied to one core, and the OS may move it around to different cores periodically. If you have a heavy single-threaded process and look at a resource monitor, you would likely notice that every once in a while the core with high utilization suddenly goes nearly idle and a different core spikes instead.

My thought is that you are probably going to deadlock Unity if you go about it this way.

If you have so much work where parallelizing helps, you should be able to accomplish it by scheduling multiple jobs in succession without completing them in-between. If you have trouble figuring out how to do that, I can certainly provide you some tips, but you may need to fully describe the steps of your algorithm so that I understand why you are completing jobs to begin with.

If you don’t have enough work where the overhead of multiple jobs is problematic, odds are you are better off with a single-threaded job. Single-threaded jobs can take advantage of cache efficiencies for small workloads much better than parallel jobs can. So in terms of throughput, they are usually more optimal.

Thank you. So makes sense on my original idea of maxThreads being the maxProcessorCount. Whether this will work as intended I don’t know until I try it. At least will succeed in finding out what doesn’t work if that’s the case! To simplify how this concept works (at least in head, not sure how other bit sort algorithms work), here’s a graphic. The idea is more reading than writing. On the initial sort I also create a bitmask for redundant bits that don’t need sorting, to define the next bit to sort. This won’t work exactly as below it’s a little more complex (ie. what happens if all 0s or 1s), but gives an idea. And when the remaining range to sort is negligible, normal sort methods may make more sense for <16 items, and at least queuing onto the same thread to finish until expiry. Thread 1 will always be working on the left most range until expiry.

Array of 16 x byte (for simplicity purposes)

In my original version, I created memoryRanges (bit, left*, right*) which were loaded back into a new job. I just thought if this just continues without creating memoryRanges, instead each thread/indice has it’s own memoryRange access in an array (which can be updated then activated on idle threads by a thread/indice sharing a new range) + it’s own queue if no threads are free, this may prove to be more streamline (maybe!) It’s all about less ops/writes.

Actually, found my original thread you were commenting on before. Although I had good performance, my code then was a little convoluted! Burst inspiration - Unity Engine - Unity Discussions

@DreamingImLatios Here’s a thought. If I went with the original option of coming out of a job, and then rescheduling jobs, then aren’t the cpu core registers going to be cleared to re-setup each time?

First off, you should never “come out of a job, and then reschedule jobs”. The main thread should schedule all your jobs in one go, setting up dependencies with JobHandles. You should never have to wait for the the jobs to complete until the very end when you need the final sorted results on the main thread (assuming you even need that, you might instead use those results in another job).

Second, populating the registers is like a few nanoseconds tops. And you’d be doing that once per job, which is not measurable. If you instead were referring to the caches, then something to note is that caches are always full. What changes is what they are full with. And the CPU itself does the swapping somewhat lazily. That means if you have one job that iterates over data, and then you have another job that runs right afterwards that also iterates over the same data, then the data may still be in cache (assuming the same CPU core does all that work, which the OS and Unity control). And in general, the data will probably be in the shared L3 cache. Nothing about jobs “clears” anything. The exception being Allocator.Temp acquired blocks and buckets (but not their contents).

Suppose I’m on about the faster L1/2. If I’m working on the same memory, and essentially a thread might still be working on for example 50% of the same range on the next iteration, and then 50% of the new range on the next, then if I schedule again with a dependency, will each iteration of the job be cleared in each cores cache, ie. break/reload from dependency to the next? I’m not sure how smart the compiler can be unless it says ok, you have a dependency from JobTypeA to JobTypeA, I’ll combine those as if they are one while the iterations are separated.

When there is a bulk of larger memory ranges to sort in one go, it probably wouldn’t make much difference, but when fragmented into many smaller ranges sorted/reiterated more frequently, (focusing always from the left until completion) I just wonder if there is anything to what I’m saying. Not sure! I have read a lot on cache hits, localisation of data, simplified patterns, vectors, and inner loops. Although doesn’t give me a skill level, just inspiration, and hopefully I learn something, or one day make something useful! :slight_smile:

Nothing will ever be cleared, but at any point, a different CPU core might pick up the thread and have to reload everything into L1/L2 again. Combining them into fewer jobs would decrease how often that happens, but that can be tough to do if each job is a parallel job.

The compiler doesn’t really know anything about caches nor the job dependency chain. Cache logic is handled almost entirely in hardware, with only a handful of instructions that can hint it to do things differently (which the hardware can choose to ignore entirely). But such instructions are rarely used (it is hard to not make things worse with them) and the compiler will never generate them without explicit intrinsics in your code.

Indeed, this is one of the main reasons why single-threaded code can often be far more efficient than multi-threaded code. There’s no silver bullet here. Only tradeoffs based on the needs of your use case. Do you have a very large array you need to have sorted as soon as possible? Then perhaps go parallel. Do you have a smaller array to sort, or perhaps a bunch of other work that also needs to happen around the same time? Then single-threaded may be faster.

I’ve said this before to you and I will say it again, having an explicit practical goal will make this stuff come easier to you. Even if your goal is temporary and made up on the spot, having a concrete target you are aiming to reach will allow you to test your reasonings and truly find out what works and what doesn’t. And if you can concretely explain your target to me, I can usually provide more focused tips.

Haha thank you, story of my life is trying to do too many things (or types of hobby) at the same time rather than focusing on one goal and completing it, or becoming refined at understanding one thing completely. However when I do that, it feels like an accomplishment. I do find when I come back to things much later, it makes more sense, like with the convoluted code in my original bitsort. I mean it seemed to work very well, working on one small part at a time, but I fear the judgement I’d get from my method of doing it!

As for this iteration I’ve not referred back to previous, only the idea, blank slate. The current sweep method of this iteration I worked on yesterday (without all the other stuff I need to do, forget the multi-threading) I think this seems tidier, simpler and more logical.

Do wish I understood the Burst inspector better, so I could understand if something is more or less efficient. There’s seems to be much more going on than when I looked at it ages ago (actual Burst boilerplate on the job creation?) If you have a link to some reference someone has made to understanding it fully, that would be useful! I’m intrigued on the ‘while’ being above the do/while loop here.

Actually as I recall in my previous iteration, I was shifting the bit to the least significant bit, and comparing it to 1 or 0, a lot harder work for myself! So this has got to be way faster here.

I guess my point was to at least have something short-term to guide you. If you have ever seen Sebastian Lague’s videos, long term he follows all sorts of tangents and wild paths of exploration, often revisiting previous things. It may seem very unfocused and highly explorative. Yet at any given moment in any one of his videos, you’ll notice he has a clearly-defined goal, even a temporary one. I believe that’s one the key factors that makes his explorations so successful, even if he makes missteps along the way sometimes. I only point this out because I can tell you really are trying to grasp some of these fundamental concepts, and based on what you have told me, I really think a slight shift in mentality could really help you. But I’m just some stranger on the internet. Do whatever makes sense to you. :smile:

I don’t see that from the screenshot.

Lots of good resource links on this page: About Burst | Burst | 1.8.17

Also, I’m sure I’ve linked this to you before, but I frequently cover and explain Burst-inspector details in this series: Latios-Framework-Documentation/Optimization Adventures/Part 1 - Find Pairs 1.md at main · Dreaming381/Latios-Framework-Documentation · GitHub

I suspect most people who use the Burst Inspector fluently have transferred skills from playing with https://godbolt.org/ in other languages. It may be worth looking up C++ blogs that reference this tool to help build up an understanding. Personally, I learned computer architecture via FPGAs (and some experiments involving caches) and then assembly via PIC microcontrollers, before transferring that knowledge to x86. That does make it a little hard for me to recommend resources.

But hopefully this all helps. If you have any more questions, don’t hesitate to ask me or reach out to me in a PM!

1 Like