Fetching DynamicBuffer Off Thread and Iterating in Parallel

Hi All,

I’m working with DynamicBuffer and having trouble piecing together a solution to get write access to a DynamicBuffer off thread and iterate over its contents in parallel.

For example, the following approach allows me to iterate and modify the DynamicBuffer in parallel but causes the calling system to wait on the main thread for write access to the DynamicBuffer<Grid_Humidity>.

Note: All of the grid layers are on a singleton Entity with a DynamicBuffer for each layer.

SelectionSystem.OnUpdate()

DynamicBuffer<Grid_Selection> selectionGrid = GetBuffer<Grid_Selection>(GetSingletonEntity<Grid_Selection>());
DynamicBuffer<Grid_Humidity> humidityGrid = GetBuffer<Grid_Humidity>(GetSingletonEntity<Grid_Humidity>());

Dependency = new AddHumidityWhereSelectedJob()
{
    AddAmount = (int)math.round(ADD_HUMIDITY_PER_SECOND * Time.DeltaTime),
    SelectionGrid = selectionGrid.AsNativeArray(),
    HumidityGrid = humidityGrid.AsNativeArray()
}.ScheduleParallel(humidityGrid.Length, 64, Dependency);

AddHumidityWhereSelectedJob

[BurstCompile]
private struct AddHumidityWhereSelectedJob : IJobFor
{
    [ReadOnly] public int AddAmount;
    [ReadOnly] public NativeArray<Grid_Selection> SelectionGrid;
    public NativeArray<Grid_Humidity> HumidityGrid;

    public void Execute(int index)
    {
        Grid_Humidity humidity = HumidityGrid[index];
        //TODO: Batch 4 at a time and profile
        humidity.Value = (ushort)math.mad(SelectionGrid[index].Value, AddAmount, humidity.Value);
        HumidityGrid[index] = humidity;
    }
}

This works well.

  • The changes are written back to the DynamicBuffer on my grid entity

  • The work is spread across my worker threads.

However, the GetBuffer<…>() calls in SelectionSystem.OnUpdate stall the main thread until write access can be gained (when other jobs stop writing to the buffers). This seems to be confirmed in the profiler. I understand the need to wait for write access but I’d like to wait off of the main thread so that the main thread can continue to progress through other systems.

An Existing Read Only Solution
In other systems where I only need read access to the grid I’ve solved this by scheduling a sequence of jobs that will

  • Copy the DynamicBuffer contents into a NativeArray via an IJobEntityBatch
  • Perform work that reads from the grid data (NativeArray)
  • Dispose the temporary NativeArray

This is great because it allows the system to schedule the three jobs and then get out of the way on the main thread while the jobs wait for data access and dependencies to be resolved.

Question
Knowing that

  • We can’t spawn jobs from within jobs
  • I want to spend the bare minimum time on the main thread
  • I need to write to one of the DynamicBuffers

How can I safely gain reference to a DynamicArray off thread and then modify the contents in parallel?

I almost need some combination of IJobEntityBatch and IJobFor.ScheduleParallel()
or a way to pass a NativeArray reference to a job that won’t be initialized until the previous job completes.

Any thoughts or suggestions would be very much appreciated. There are a ton of tools at our disposal and I’m still getting a feel for how they can come together in useful ways.

Thanks!

The one downside of dynamic buffers is that you can’t write to a single instance in parallel without a sync and without disabling some safety checks. With that said, the combination you want is BufferFromEntity and [NativeDisableParallelForRestriction].

Thanks DreamingImLatios. I hadn’t thought to try out passing BufferFromEntity into the job.
What’s interesting is the tradeoff between the two approaches is about a 10x between main thread vs worker thread costs.

Original GetBuffer Approach
Main thread: ~0.4ms
Worker threads: ~0.03ms

BufferFromEntity Approach
Main thread: ~0.03ms
Worker threads: ~0.3ms

This is what the new implementation looks like. I’m assuming this is what you were suggesting.

SelectionSystem.OnUpdate()

Entity gridEntity = GetSingletonEntity<Grid_Selection>();
GridSize gridSize = GetComponent<GridSize>(gridEntity);

BufferFromEntity<Grid_Humidity> humidityGridFromEntity = GetBufferFromEntity<Grid_Humidity>();
BufferFromEntity<Grid_Selection> selectionGridFromEntity = GetBufferFromEntity<Grid_Selection>(true);

Dependency = new AddHumidityWhereSelectedJob()
{
    AddAmount = (int)math.round(ADD_HUMIDITY_PER_SECOND * Time.DeltaTime),
    TargetEntity = gridEntity,
    SelectionGridLookup = selectionGridFromEntity,
    HumidityGridLookup = humidityGridFromEntity
}.ScheduleParallel(gridSize.Value.x * gridSize.Value.y, 64, Dependency);

AddHumidityWhereSelectedJob

[BurstCompile]
private struct AddHumidityWhereSelectedJob : IJobFor
{
    [ReadOnly] public int AddAmount;
    [ReadOnly] public Entity TargetEntity;
    [ReadOnly] public BufferFromEntity<Grid_Selection> SelectionGridLookup;
    [NativeDisableParallelForRestriction] public BufferFromEntity<Grid_Humidity> HumidityGridLookup;

    public void Execute(int index)
    {
        DynamicBuffer<Grid_Selection> SelectionGrid = SelectionGridLookup[TargetEntity];
        DynamicBuffer<Grid_Humidity> HumidityGrid = HumidityGridLookup[TargetEntity];

        Grid_Humidity humidity = HumidityGrid[index];
        humidity.Value = (ushort)math.mad(SelectionGrid[index].Value, AddAmount, humidity.Value);
        HumidityGrid[index] = humidity;
    }
}

The real unfortunate part is having to pay that BufferFromEntity lookup cost on every iteration.
The per thread init feature can’t come soon enough!

I wonder if it would be faster to copy the buffer to a native array, make my modifications, then copy the result back in three separate jobs. The trouble would be making sure that another job doesn’t come in and modify the DynamicBuffer between making the copy and writing back the results. Maybe I can figure out a way to hold the write reference through the whole run…I’ll give that a shot.

PS: Big fan of your optimization adventures posts. Thanks for everything you contribute.

There’s a few things to try:

  1. Depending on how many worker threads you have and the other kinds of jobs you have running, it might be worth it to use a single IJob and do the buffer fetching once at the beginning of the job.
  2. If parallelism is too helpful, then try IJobParallelForBatch. It is kinda the middle ground between IJob and IJobFor.
  3. With the exception of IJobFor where it wouldn’t make any difference, it is usually faster to get the dynamic buffer as a NativeArray before indexing it. The reason for that is because a dynamic buffer can be stored in one of two locations and there’s a branch in the indexer to resolve that. Getting it as a NativeArray caches the result of that branch so you have more direct index lookups.

I’m glad you enjoy it! :smile:

1 Like

Thanks for the ideas :slight_smile:
I’ve done a ton of testing today and I’ll share some results tomorrow after I clean up the numbers and examples.

The overall takeaway though has been that for simple calculations (like in my example above) running a single thread job can go a long way. At ~260k iterations (512x512) a single thread IJobEntityBatch was by far the fastest approach so far. This exploration has been good to understand where the trade-offs are and the strengths of different approaches.

Re: Ideas 1) and 2) those are great suggestions. I’ll give them a shot. There are so many ways to mix and match the tools available I completely overlooked those approaches!

Re 3): Agreed, although I’ve found the overhead of DynamicBuffer.AsNativeArray() isn’t 0. In my previous post I tired using AsNativeArray instead of the buffers directly per iteration and it was slower. The cost of AsNativeArray is not worth the 1-2 index lookups I perform per iteration. No surprise there but worth mentioning.

2 Likes

Did I say I needed one day…should have said one week.

I did a bunch of tests and came to the following conclusions.

  • For smaller grids (Ex: 128x128) of simple calculations it’s not really worth the effort to go parallel. Just get off main thread and do your work on one thread.
  • For larger grids (Ex: 512x512) of simple calculations parallel is worth setting up.

So, with that in mind the following approaches were all comparable in terms of small grid performance:
Entities.ForEach and Schedule() - ~0.35ms on 128x128

  • Build a ForEach query with the buffer(s) you need
  • In the lambda method get a NativeArray reference to the Dynamic Buffers
  • Schedule the query

IJob and Schedule() - ~0.33ms on 128x128

  • Create an IJob that takes in BufferFromEntity<T> instances for each DynamicBuffer<T> you need
  • In the job, on Execute() fetch the DynamicBuffer instances as native arrays and iterate through the buffer(s) doing your work
  • Schedule the job

IJobEntityBatch and Schedule() - ~0.35ms on 128x128

  • Create a query that fetches the buffer(s) you want
  • Create an IJobEntityBatch that takes in a BufferTypeHandle for each buffer you need.
  • In the job, on Execute() use batchInChunk.GetBufferAccessor(someBufferTypeHandle) to get reference to your dynamic buffer(s). Convert those references to NativeArrays and iterate through the buffer(s) doing your work.
  • Schedule the job submitting the query you created in step 1

For larger grids where parallel work is helpful I have two approaches that seem to work well.
IJobParallelForBatch and ScheduleParallel() - ~0.24ms on 512x512

  • Same approach as the IJob approach above but implemented with an IJobParallelForBatch.
  • You will need to know the size of your dynamicbuffer(s) when scheduling (before having reference to the buffers)
  • There are opportunities to tune the batch size in case your buffer(s) are occasionally small enough for the BufferFromEntity lookup to be too costly per thread.
  • Of course if your buffers are consistently small then you should use one of the single thread approaches above!

Copy Write Back - ~0.19ms on 512x512
(maybe there’s a better name?)
This is a three step process where you set up three jobs that depend on one another

  1. Copy the dynamic buffer(s) to a NativeArray (TempJob)
  2. Do the work in an IJobFor passing in the native arrays and writing changes back to them
  3. Copy the native array back to the dynamic buffer

The Copy Write Back approach was surprisingly fast. I didn’t expect it to keep up with the IJobParallel and in fact it is quite a bit faster for much larger grids at 2048x2048

  • IJobParallelForBatch - ~3.6ms
  • Copy Write Back - ~2.6ms

If anyone thinks there’s something suspicious going on here or my conclusions are off base feel free to let me know and I’ll investigate! I’m also happy to share implementations in case it helps someone else. Just ask!

Thanks @DreamingImLatios for the suggestions and help. This is has been a great learning experience.

The fact that your final approach involving copying to a temporary array is faster than iterating on the data directly suggests that the code gen for IJobParallelForBatch with ScheduleParallel is suboptimal. It would be interesting to investigate why as you likely could see those times cut in half or more.

@mbaker1 Do you test it with safety checks enabled? Do you test it in the build? (well it should be the only valid test)

.AsNativeArray() is literally getting the pointer
7906672--1007794--upload_2022-2-18_11-44-33.png
and just creating one NativeArray struct (without allocating memory as it just directly assign buffer pointer)
7906672--1007797--upload_2022-2-18_11-45-49.png

(I’ve shown actual code without all the safety stuff which will be removed from the build). Sorry but I don’t believe that getting a pointer and creating one single struct with a couple of int fields and then using that in another job will ever be slower than copying whole the buffer data to a newly allocated array using it somewhere else and copying back, is your tests counts allocating new arrays for copy back and forth in the last case?
Well and probably better to see the final test code, to be sure it’s a fair comparison.

1 Like

Good point and exactly why I was surprised. With fresh eyes today I did notice that I forgot to use .AsNativeArray() in my IJobParallelForBatch. Fixing that oversight does in fact cut the time by ~50%.

New times on a 2048x2048 grid:
IJobParallelForBatch - ~1.6ms
Copy Write Back - ~2.6ms

That’s more like it!


Some strange observations.

Inconsistently Slow Copying DynamicBuffer to NativeArray Job
Occasionally, my copy write back approach takes far longer to copy one of the buffers to a NativeArray (both in editor and standalone). I haven’t pinned down whether this changes between compiles or run instances but it’s strange.
When this issue does come up every frame will have this slower copy to native array phase. Then other run instances (or builds) will be a more reasonable ~0.3ms.

Slow Copy to NativeArray Example

Fast Copy to NativeArray Example

Periodic Slow IJobParallelForBatch in Standalone Build
About every 10 frames in a standalone build I’ll see the IJobParallelForBatch take 12ms-27ms and only use two worker threads. No idea what’s happening there but something to investigate another day. I’m am kind of curious whether the scheduler accounts for efficiency cores vs performance cores when allocating work.
(Tested on an M1Pro so maybe that’s a factor)


AsNativeArray() Overhead - almost nothing.
@eizenhorn I think there’s a misunderstanding here. You’re right the overhead is minimal and should generally be used. But it’s not zero and where I was seeing slowdown for using AsNativeArray() was in one of my first, very naive implementations where the AsNativeArray() call was being made on every iteration.

Example - Every Execute call in IJobFor:
GridBufferFromEntity[TargetEntity].AsNativeArray()
This was slower than just using the Dynamic buffer directly when you’re only reading/writing one element.

Of course, knowing what I know now, it’s obviously not a good idea to fetch through a BufferFromEntity instance on every iteration so the point is mute.

I’m curious which Unity and entities version you use? As I expect bursted jobs to be light green in profiler?

1 Like

Unity 2020.1.9f1
Entities 0.17.0-preview.42
DOTS Editor 0.12.0-preview.6
Jobs Version 0.8.0-preview.23
Burst 1.4.1

Those are the latest versions that work well together as far as I could tell but let me know if that’s changed. It wasn’t super clear what versions worked well together.

Well, latest 2020.3 LTS works well (the only thing is NativeList need small fix in sources, change allocator type), and we’re on 2021.2 with game in release - also everything works fine (with NativeList fix mentioned above and exclude subscenes, we not use them, as result can’t comment about their stability)

Did you upgrade the other packages as well or just the editor?

I think we’ll probably stick with our versions for now since we’re not hitting any major limitations and are a long ways from launch. No need to get into locally modifying package source at this point!

Good to know what upgrades are possible if needed though. Thanks!