A good example is this Ray Tracing tutorial from gamasutra. I implemented some ray marching recently but without using any kernels. I simply do all the operations in the shader. My question is: isn’t the shader already intrinsecally parallelized by the GPU as it’s computing everything in the fragment function?
Why would I gain anything (and how much would I gain) using a compute shader instead of a simpler fragment shader?
Thanks
I have the same question. Any results?
I haven’t delved too deeply myself but my understanding is that there are some things you can’t do in a fragment shader.
It’s mainly around the lack of access to buffers or any ability to either precalculate stuff or store intermediate calculations for later reuse. A fragment shader is dumb and only knows about it’s own single pixel. With a compute shader there is are many possibilities for optimization that wouldn’t be possible in a fragment shader.
Search Github for “raymarch” (or “sdf” etc) and “compute shader” - there’s a few projects that might give you some ideas.
Actually you can bind buffers to the pixel/frag stage to read or write. You can even compute stuff in a compute shader and then bind that result to the frag to read from without transferring data around.
Some benefits of compute off the top of my head:
- Control over the computation resolution and hardware resource distribution instead of it simply being the pixels the triangle falls on.
- Asynchronous or simple pre-computing of a result.
- Can easily implement reductionist computations, where the output of one compute is a lower amount of elements to then be computed in another compute kernel, and so on… leading to more optimal calculations.
- Can compute arbitrary data that may not relate to a specific pixel, such as vertex data (vertex pass only knows about its current vertex, geo/tess only can know up to 6 adjacency (3 in unity) and would be more wasteful in many circumstances) or any other computation that would benefit from highly parallel processing.
- Is specifically designed for input/output to arbitrary buffers and allows for further optimizing through the use of thread/work groups and group-shared memory.
And I’m sure there’s much more.