Help optimizing compute shader

Looking how to utilize as much parallelization processing as possible for the following task:

A screen-sized texture is completely black except for blue pixels which are in close proximity to geometry edges (this is a game scene). So at a given time, a small percentage of the texture is blue. This texture represents a mask that only colors pixels near an edge. Each blue pixel can access four unique subsample values of the depth buffer. For each subsample, I need to calculate the difference between its own depth value and the depth value of all of its surrounding subsamples (up to two or three pixels away, not sure yet), check if the difference is greater than a threshold value, and if so, store the distance between those two subsamples. Finally, find the minimum distance stored. This generates an SDF value which describes how close that subsample is to an ‘edge.’ The image illustrates one of these groups of comparisons. This has to be performed again for every subsample.

This is obviously a very large task which is why I am trying to utilize the parallel processing capabilities of a compute shader. I want to disperse the calculations to as many threads as possible, in order to avoid a small number of threads performing very large loops but I am challenged to find a way to organize the calculations so that only nearby subsamples are compared with each other.

I originally thought of having the blue pixels send out groups of values into an appendbuffer (where the element type would be an array), and then each thread could look-up the array in that appendbuffer that corresponds to its index (ie appendbuffer[(globalID.y * screenWidth) + globalID.x)]) and find the minimum distance for the values in that array. The trouble comes after this step though. I would need to then find the minimum of THOSE minimums to find the absolute minimum, and there doesn’t seem to be an obvious way to keep arrays for the same subsample stored adjacently in the appendbuffer. Therefore there is not an obvious logic to ensure that the correct minimums are compared with each other. Adding values to an append buffer is never done in a predictable order. This approach might also be suboptimal in terms of cache coherency

I am not a graphics programmer (or any programmer for that matter) by profession. Hoping someone with a lot of experience can weigh in here and give a good recommendation. I am measuring the compute shader’s performance in RenderDoc

Only skimmed over it but sounds like something that could be solved with group shared memory and group syncs (memory barriers).

run a compute shader where each thread operates on an output pixel
in each thread

  • gather all the distances to surrounding pixels and store the minimum in group shared memory
  • GroupMemoryBarrierWithGroupSync
  • have the first thread compute the minimum of the minimums and store it in a small output buffer of size (width / NUM_THREADS_X, height / NUM_THREADS_Y) at (SV_GroupThreadID.x, SV_GroupThreadID.y)

Rinse and repeat using the small buffer as input, storing the result in an even smaller output buffer.

Repeat until you have only one value left.

You’d have to profile it to find the best group size. I think, the maximum is 1024 threads per group but that means, the first thread would be doing a lot of work. There is also a limit to group shared memory, of course.

Don’t forget that you need an UAV barrier between the dispatch calls in DX12 and OpenGL. I’m not sure if Unity handles that automatically for you.

PS: Theoretically, you could also use more than one group sync to reduce the amount of work for the first thread. E.g. you could use half of the threads to calculate the minimum of two values, sync, use a quarter of the threads to calculate the minimum of the previous results etc.

By the way, this is called reduction. You’ll find a lot of information if you google that. E.g. first link I found was this (but haven’t read it yet)

1 Like

Thanks for your reply. I definitely agree that using reduction techniques in the overall approach is very important. However, the suggested approach still relies on a smaller number of threads than I would like, and relies on loops more than I would like. I was hoping for a way to utilize more parallelization by assigning more individual calculations to a unique thread. With your approach, there is also another limitation - neighboring pixels that need to be checked/compared may belong to a different threadgroup than the pixel they are being compared to.

My ideal approach would be to have every blue pixel send its subsample values and a handful of comparator subsample values into an appendbuffer whose element is a float array. Then a subsequent compute shader could have each thread do operations on the small float array that corresponds to that thread’s id (which would be the lookup index in the appendbuffer). That way I could split up the calculations into small pieces and sent them to unique threads, while still making sure that I still keep the appropriate groups of values together.

Do you know if there is a way to have one compute shader append to a buffer, and have a subsequent compute shader read that buffer? I know that fragment shaders can read an appendbuffer without any intermediary CPU communication, ie using DrawProceduralIndirect(). But I have never seen a case of a compute shader reading from an append buffer… Thats what I am currently needing help with

Then I misunderstood something. I thought you’d only need to check neighboring pixels in the very first step when you are reading from the depth buffer. It wouldn’t be a problem to do that since you are reading from a texture, not group shared memory in the first step. But as I said, I only skimmed over it, sorry.

Don’t understand why you’d like to use an append buffer. Can’t you just use a 2D buffer/texture? This would be faster because it doesn’t need to lock. Don’t you know the number of output values upfront?

Not from the top of my head, sorry. But I’d be surprised if that wasn’t possible.

You are correct on the first point. Thanks for pointing that out. The reason that I want to use an appendbuffer (and it looks like I have found a way to read the appendbuffer in a subsequent compute shader!) is twofold:

  1. currently the depth values are being read at “blue pixels” where edges are detected in the scene’s geometry. This is going to be different for any given frame, it is not predictable. There is going to be a relatively small number of pixels that are “edge pixels” so if I do all my calculations on threads that correspond to those pixels, I am using a very limited number of threads. And there is a LOT of calculations to perform on the data in each pixel (ie, each subsample being compared to all surrounding 64 other subsamples, then finding the minimum of all results, this whole process is done for every subsample). This means a lot of loops per thread. I want to spread all the calculations across as many threads as possible to avoid loops since parallel processing is faster.

2). By sending small chunks of values to each element in an appendbuffer (ie, sending groups of 4 nearby subsamples) I can now allocate small portions of data across as many threads as I want. So if I had 80 “edge pixels” and this corresponded to something like 50,000 groups of 4x4 comparisons, I can send that data to an appendbuffer containing 50,000 elements, and then have 50,000 threads perform these calculations instead of 80 (even if I used the whole thread group the white pixels are found in, I would still be limited to like 5,000 or something). So with my calculations spread across many more threads, I have much fewer loops to wait for per thread.

I will absolutely use the resource you linked that discusses reduction. That is definitely going to help with performance because I will not be able to avoid loops entirely

lets say I have my thread group size set to 16x16. Now consider a 16x16 block of pixels that contains 10 “edge pixels”. Each edge pixel has 4 subsamples. And we want to do our subsample comparisons across all subsamples in the surrounding 3x3 region of pixels. 3x3 is 9 pixels, each one has 4 subsamples, so that means 36 comparisons. So if each subsample requires 36 comparisons, and there are 4 subsamples in each edge pixel, I am looking at (10 x 4 x 36) = 1,440 comparisons. So if my thread group is 16x16, I have much fewer threads than comparisons. If I had a larger thread group size, then I could potentially have many more “edge pixels” as well. So that doesn’t solve the problem. Allocating small sets of data to arrays in an append buffer truly lets me spread all the comparisons out over as many threads as I want. I have not tested this yet, so let me know if you see an obvious flaw. I always appreciate people helping me think about my problems!

So you are doing
if (subsample > 0) result = min(absolute_min, subsample);
?

This is worse than having just the min. The cost of min is 1 cycle, the cost of dynamic branching is about 5 (ballpark numbers, depend on hardware). All threads run in lock step, so if one thread takes the branch, all threads take the branch.

Check out this video starting at 10:06 to see what I am doing.

I think branching is unavoidable because the “mask” is required to limit the heavy calculations to a smaller set of values instead of performing them on every pixel in the screen