/Question [Compute shaders][multithread] trying to update computebuffer off-thread

Hi all, sorry if posting to the wrong forum, I couldn’t find a more appropriate one. Also it’s a big’un. I promise there’ll be a joke or two.

Bit of context :
My app is displaying very large point-cloud data that is selected by users at runtime. I’m rewriting the rendering of these pointcloud to do as much of the work off the main thread.

I have an approach that works pretty well, where I maintain a very large compute buffer with “slots” that hold the compressed data of nodes currently in the frustum, then have one call to Graphics.RenderPrimitives(…); draw the whole pointcloud in one drawcall. The vertex shader decompresses the point data, and a geom shader emits a quad. It’s great.

Using Unity 2021.3.7f1 at the moment.

The problem at hand :
The bottleneck of this approach is of course copying arbitrary data to the large compute buffer, in small batches that go to different offsets.
All the documented ways to get data on the GPU that I found have some dependency or other on the main thread, which I really wish I could sidestep, as the display algorithm already deals with data being “on its way” in a threadsafe way.

The things I tried :

  • bigBuffer.SetData(bytesReadOffThread, offset, size); Works great, chokes the main thread

  • Pairs of bigBuffer.BeginWrite<>(); and bigBuffer.EndWrite<>();

  • if I write data on the main thread : Works great, chokes the main thread a little less.

  • if I off-thread the returned NativeArray to copy the data then endwrite on the main thread : blazing fast (well, fast)! but I can only have one copy operation per frame :cry: (can’t call beginwrite multiple times)

  • Updating to Unity 2022.3 ( that… didn’t go well (link)) and shifting to GraphicsBuffers so I can use a combination of GraphicsBuffer.LockForWrite and Graphics.CopyBuffer(). ლ(ಠ益ಠლ)

  • current approach and bug below

The bug I’m encountering :
My current approach is a hybrid one where I have a multitude of in-flight BeginWrite<> operations to smaller compute buffers, that I call “source”, and then use a very small compute shader to copy those to the main one. Here is the compute shader (insanely simple). The small buffers are part of a pool that I acces in a thread-safe way.

#pragma kernel Copy

RWStructuredBuffer<int> destination;
StructuredBuffer<int> source;
int dataOffset;
[numthreads(64,1,1)]
void Copy (uint3 id : SV_DispatchThreadID)
{
    const uint idx = id.x;
    destination[dataOffset + idx] = source[idx];
}

This works reasonably well, until at some point, the big compute buffer gets wrong data. Specifically, it seems to get a copy of later data written to the same “source” small buffer on later frames. I was convinced that a computeShader.Dispatch()
call would always finish on the current frame but it seems they can span large timeframes. I can’t for the life of me find a way to enforce coherence, or at least get notified when coherence is ok.

Mitigation :

  • a call to source.GetData() seems to enforce coherence, at the cost of speed (a lot)

  • Putting the source buffers on “timeout” (about a second) for some time seems to avoid the problem but it is very brittle, and it means having a lot of them hogging resources. For context, “bigBuffer” is 2GB, and sources are 200KB (for now, but I need to tweak that at some point).

  • using a brand-new source buffer for every iteration. This works, but it seems like a waste of resources (I haven’t profiled yet, but all these new() make me uncomfortable). Also, I need to put the buffers on timeout for an indetermined amount of time between Dispatch() and Release(), otherwise the bigBuffer receives all zeroes. (╯°□°)╯︵ ┻━┻

Question(s) :
I’m willing to revisit my approach, so is there an API I didn’t find that would

  • let me update my bigBuffer from off the main-thread ?
  • enforce coherence on a ComputeBuffer after a bunch of computeShader.Distpatches (some kind of fence maybe? can’t find the proper docs as this(link) is pretty terse and I need an example)

Here are things I thought of but haven’t tried yet :

  • messing with the internalPtrs (somehow) so that the source buffers actually are “views” into thebig one

  • writing a dll that does the whole thing, bypassing Unity’s thread locks where necessary

  • Using something like Graphics.CopyBuffer (but GraphicsBuffers don’t seem to have any fast/offthread way to upload data )

  • asking the forums ಥ﹏ಥ

  • edit : typos
    Obligatory trail of messed-up syntax that appeared out of nowhere while editing this behemoth!
    ಠ_ಠ

Another idea that I have is to have multiple bigBuffers, and start an AsyncGPUReadBackRequest after dispatching copies. that way every time a batch of copies is sent, it only stops that buffer from being updated for some time. It may even help me handle the sparsity that I get because nodes have wildly different sizes…
I’ll post an update if that works.

Have you tried using a GraphicsFence to ensure that the large buffer is not accessed until the small copies are complete?

Hi and thanks, I’ve thought of using GraphicsFence, but as said, I find the docs lacking in my case. I couldn’t find any example of them used with computeShaders. It seems they are used for when you build command buffers. Are you aware of a way to use them in conjunction with computeshaders?

I’m currently profiling to see if I’m not doing something stupid, but the frame time is dominated by a Graphics.Semaphore, which doesn’t bode well for a fence.

You can add a compute shader dispatch to a command buffer like so:
Unity - Scripting API: Rendering.CommandBuffer.DispatchCompute.

Thank you I’ll try that !

OK so, As I said, the docs are pretty terse around Graphics fence, but I seem to have something that should work, if it was not for this infuriating bug :
9266169--1296882--upload_2023-9-1_14-4-48.png

No matter how I try, the CraphicsFence is always of type Aync, but it seems that no windows platform supports async compute (I had to actually test them all, as this is documented nowhere I could find…). Therefore, fence.passed always throws.

There is NO way I’m the only person trying to use GraphicsFence ?

OK so …


so, as no pipeline seems to support Async Compute, basically GraphicsFences can’t be used ? That can’t be right ?

There must be an issue with your setup. This is possibly because you are trying to use GraphicsFenceType.CPUSynchronisation even though the documentation explicitly says it is not supported. I was able to make a very simple async command buffer execution as follows:

CommandBuffer comBuff = new CommandBuffer();
comBuff.SetExecutionFlags(CommandBufferExecutionFlags.AsyncCompute);

comBuff.DispatchCompute(computeShader, kernelOne, 256, 1, 1);
comBuff.CreateAsyncGraphicsFence();
comBuff.DispatchCompute(computeShader, kernelTwo, 256, 1, 1);
   
Graphics.ExecuteCommandBufferAsync(comBuff, 0);

//to read the output of the kernels
uint[] validationArray = new uint[size];

yourComputeBuffer.GetData(validationArray);
foreach (uint g in validationArray)
      Debug.Log(g);

Hi and thanks @b0nes123 ,
Unfortunately that only works because computeBuffer.GetData(…) forces synchronization behind the scenes. The whole point of my question is to avoid locking the main thread, and GetData does that. Moreover, I don’t actually need the data read back on the CPU, I just need to know it’s been copied to GPU. It seems there is no way to asynchronously wait on the returned GraphicsFence created by
comBuff.CreateAsyncGraphicsFence();. What I tried was :

CommandBuffer comBuff = new CommandBuffer();
comBuff.SetExecutionFlags(CommandBufferExecutionFlags.AsyncCompute);
comBuff.DispatchCompute(computeShader, kernelOne, 256, 1, 1);
var fence = comBuff.CreateAsyncGraphicsFence();
comBuff.DispatchCompute(computeShader, kernelTwo, 256, 1, 1);
 
Graphics.ExecuteCommandBufferAsync(comBuff, 0);
// on subsequent frames
if(fence.passed){ //<= exception because SystemInfo.supportsAsyncCompute is false
   ... release locks, etc...
}

This fails on all the backends I tried (OpenGL Core, DX11, DX12 and Vulkan) because all of them have SystemInfo.supportsAsyncCompute set to false.

However I have had good results by using two AsyncGPUReadBackRequest(s) (one for the source, one for the sub-part of the dest). these seem to actually work asynchronously!

Still way too many hoops to jump through, but it works reasonably well. Now if I could somehow declare a Sub-ComputeBuffer and write to that, I would avoid the whole problem…