Compute shader thread dispatching

Hey

I am confused what numbers i should use for threads for my compute shader.

I want to process a buffer of 4096 float2 elements.

So what should i be using as my threads and calls in the C# code? I asked on stack overflow and go no information.

Do i just do:

[1,1,1] in C# and [4096,1,1] in the compute shader?

I can’t find a good explanation on this stuff, its a bit of a random guess work at the moment. Can any one explain sufficiently how to pick good number combinations ?

Thanks

1 Like

Unless your algorithm need threads in a group to inter-communicate via groupshared memory, you shouldn’t be using such large thread sizes. 128 or 256 are good numbers.

This is a good read on the subject:

This is also a good link to always check when in doubt about the various IDs used in cimpute shaders:

2 Likes

I will take a read though some of it seems to be over my head a bit. Is that article from gpuopen only applicable to AMD ?

No, NVidia hardware works the same. The only difference is that the “warp”/“wavefront” in NVidia GPUs has 32 threads instead of 64 (but you still should use multiples of 64 even on NV hardware because they have a “dual dispatch” system).

It’s no problem if you can’t take it all at once, but it should enlighten some points.

In short: the thread group size defines how many threads will be made to work as a “group”. The primary reason for grouping threads is to use groupshared memory: it’s a fast writeable memory that is visible to all threads in the same group and is what truly sets compute shaders apart from pixel and vertex shaders.

However, GPUs have a limited number of registers (the “VGPRs”, used to store the variables during shader execution), and the larger your thread group is the more registers the shader will consume, which can actually harm parallelism depending on what algorithm you’re running on them.

1 Like

Does this mean i should not have groups/threads less than 64 ? So for example [4,1,1] and [4.1.1] would actually be less performant because 4x4 = 16 which is less than 64 and thus not a multple?

So i should at least aim for a value that multiples to a number that is 64 or greater and be a multiple of 64? Such as at least [8,1,1] [8,1,1] as an absolute minimum ?

Should i assume that the work groups i choose on the C# script is the same as how many threads run per frame? Or do work groups also work in parallel as well as the threads?

It’s also not clear to me when i would add to the second or third dimensions of the groupings.

You can have groups smaller than 64, but the GPU will use 32/64 threads regardless (and just discard the work of the excess threads), causing processing power to be wasted. For example, a (4, 1, 1) kernel will only use 12% of a warp on Nvidia GPUs, wasting the other 88%.

The 2nd and 3rd dimensions are just for convenience when working over 2D and 3D workloads, since you get a 2D/3D identifier per thread neatly calculated for you. The total number of threads is what matters (XYZ).

Remember that the numbers you pass to Dispatch() are the amount of groups, not threads. If you want to process 4096 items and your kernel group size is (128, 1, 1), you need to call Dispatch(32, 1, 1).

2 Likes

It’s not obvious to me however what difference it makes if i did 128 * 32 or 64 * 64 to process 4096.

The smaller it is, the more opportunities the GPU has to parallelize it.

The smaller the threads per group you mean ?

64 threads should do well with NVidia, AMD, and Intel hardware. Using less than that means underutilizing the hardware.

Example:

Your CS has numthreads(4,1,1) and you call Dispatch(1024,1,1). You’re using an NVidia GPU with 512 shader cores (just an example). Each core will be handed two groups to process. Each core always works in batches of 32-threads, but your kernel group size is only 4x1x1, so only 4 threads will do any meaningful work, the others being wasted. All cores will have to run two times to process al your work.

If instead you used numthreads(64,1,1), you would only need to dispatch 16 groups, and the GPU would be able to do the same work using only 32 cores. Not only it would end at half the time, there would still be 480 cores available to run other tasks.

Keep in mind that the “threads” in a warp/wavefront aren’t like threads on a CPU. They are actually SIMD lanes (https://www.sciencedirect.com/topics/computer-science/single-instruction-multiple-data). This means that each operation is performed simultaneously, in lock-step, on all 32/64 lanes at once. Imagine that each variable in your shader is actually an array under the hood, and operations like add, multiply, etc. operate on all items in the array at once. This is normally hidden/abstracted away from developers (we write our shaders as if they operate on a single element at a time, be it a vertex, a pixel, or a CS thread), but understanding it is vital to extracting good performance out of a GPU.

11 Likes

A few quick rules for you to follow to help you and future ComputeShader users to wrap your head around it.

  • After calling computeShader.Dispatch(0,1,1,1); your cpu waits for the gpu to return a completed status.

  • This is critical information, that confuses many new users. Especially those who have experience with threads but none with gpu threads. Let me say that again… the main thread sleeps until the gpu returns a completed status!! I wish someone had told me this from the beginning!

  • Each thread has an ID and acts as an iterator in a loop!! You need to manage the iterator in way that corresponds with traditional looping. (See Figure 1.0 below)

Figure 1.0

for(int y = 0; y < 20; y++)
{
    for(int x = 0; x < 20; x++)
    {
        ComplexMathProblem(x,y);
    }
}

Iteration Management
Figure 1.0 depicts a clear picture of a standard xy loop. I will explain how emulate this on a gpu below.

computeShader.Dispatch(0,1,1,1); dispatches 1x , 1y and 1z thread groups.
This means that thread group x and y will both execute 1 time. These are the dimensions.
Have a look at Microsoft’s explaination of this layout:

Figure 1.1

To effectively use a compute shader, we need to have an intimate understanding of what our loop needs to do, as we need to convert that iteration (similar to recursion in a way) into a new kind of iteration we’ve never seen before. Figure 1.2 explains.

Figure 1.2

[numthreads(20, 20, 1)]
void CSMain(uint3 id : SV_GroupThreadID)
{
    int x = id.x;
    int y = id.y
    ComplexMathProblem(x,y);

}

Figure 1.2 depicts the exact same looping operation as figure 1.0, but without the for loop.
Everything that occurs inside the for loop of figure 1.0 will occur inside CSMain here - Without the for loop.
Each thread will (at the same time and in no particular order) execute the contents of
CSMain. However the order doesnt really matter. The cores each have an id, and all of them should execute. All you need to do, is look at each core as an iteration of your loop.

For larger programs outside the scope of the gpu, you can make the request multiple times from the gpu, or you may have to wait longer for the gpu to return the data. This could result in frame drops, so its important you optimize your code and do as much on the gpu as possible. .

Now notice, I am using the SV_GroupThreadID.
In the [numthreads(20, 20, 1)] I have asked the gpu for 20 x threads and 20 y threads.
This is a total of 400 threads.

I hope this helps. I am still learning too, we all are! Its a science after all!

5 Likes

This is great but why then do we have to set it in C# aswell with the dispatch call?

https://docs.unity3d.com/ScriptReference/ComputeShader.Dispatch.html

Surely this is the GPU’s job so why do we have to send how many thread groups in the dispatch when we declared numthreads in the GPU already.

1 Like

Afaik you want to transfer as little data from GPU back to CPU as possible, in the Dispatch, you ask your GPU for given number of threads, but CPU has no info from the compute shader (or buffer) other than those received using ComputeBuffer.GetData (which is terrible slow btw) and numthreads in the shader has to be a constant known at compile time. I agree It could be handled better to avoid duplicities, but maybe it is not possible.