Performance penalty of branching shader code?

Hi,

I’m trying to understand the penalty to the frame as a whole if some parts of the screen render faster than others based on branching code in the same shader.

This is about a screen space reflections algorithm, but it could be anything. Assume that everything on the screen is rendered with the same shader, but some areas of the image start to use heavier features that involve iteration, like if there’s a puddle.

I understand that pixels are processed in parallel in cells of X * X pixels, and that any slow pixel in a cell will force fast pixels to wait for the slow pixel.

But what about the cells themselves? If the puddles only appear in e.g. 5% of the cells, does this force the remaining 95% of the cells in the image to wait for the puddles? Or is the bottleneck purely about fast and slow pixels inside the same cell?

Again, it’s important to remember that this is all the same shader, but some if them will activate an “if (puddleStrength > 0)” branch, and go into iterative SSR processing just on those pixels.

Any input would be much appreciated!

Each X*X group of pixels is its own “warp” or “wavefront” as Nvidia and AMD like to call them, which can be scheduled with its own set of work for those pixels (threads) inside to do. With most rendering, once a warp is done, it is free to be scheduled with other work, it does not need to wait on the rest of the warps rendering the screen to finish. (this can be false when you get into compute shaders or shared memory with locks/atomics which can result in all the warps running that kernel to wait for the atomic operations or memory sync to finish).

You can kind of think of these groups as the actual “cores” of the GPU, and each core has a bunch of SIMD threads inside. So once a core is done and not being held back by something more explicit, it can go on to do other work.

Perfect answer, thank you!

One follow-up, in case you have time, because the GPU’s branch prediction might still create a problem.

The example is still doing some expensive SSR on a pixel if it turns out to be reflective, and only then. But it’s only when I’m shading the pixel that I know whether it turned out to be reflective (let’s say that reflectivity is procedural or sampled from a texture), how do I prevent running SSR on every single pixel?

We have to imagine an “if (puddleStrength > 0)” statement. As I understand, GPUs will tend to run both branches of an if statement, meaning that I’m actually doing this for every pixel all the time. It sounds like the GPU has to be told before the pixel is shaded whether this branch will be activated, like with a constant or a keyword.

Keywords are off the table (too big a hammer), but I might be able to set a CBuffer value on a per-material basis if I can calculate from the material settings that there’s no way procedural puddles will show, and then turn SSR off wholesale on that material with those puddle settings.

Is my reasoning sound here, that unless I can tell the GPU before the frame is rendered whether the branch will be used, that I should count on every branch running all the time, and then “wrong” branches are just thrown away by the GPU?

Thanks,

Per

You are correct that it is more optimal if you can set some sort of constant/kernel-level value before the fragment program is run so that the whole branch can be skipped, otherwise both paths are always being computed on all the threads. But also, you can force a branch to be a dynamic branch on most modern hardware by using the UNITY_BRANCH keyword before a branch in the code. There is a bit of an instruction penalty for having to handle the branch dynamically, but if it means saving a lot more work then it’s worth it.
The compiler may also automatically do this for you if it thinks the work is large enough, but it’s generally best to dictate it yourself so you can be sure.

The CBuffer route could also help with performance, as long as it’s not causing your draw calls to increase significantly. But having less data to pass into the kernel is also a helpful thing. So it’s one of those scenarios where profiling is the best way to see the true answer.

Hi,

I did not know about UNITY_BRANCH, that’s dynamite. Yes, in this example, the difference is enormous, something like 100x more processing. When a surface is reflective, I need to suddenly do ray marching, as well as sample the environment (refl. probes). And when it’s not reflective at all, none of this is done.

I will study up on that keyword, thanks a million!

1 Like

Depending on how expensive the work is, and if on average only a fraction of the quads in your screen need to have the work done on them, it might be worth it to have a classification step to build a list with only the blocks which need processing.

This is why modern games usually do things like SSR and even deferred rendering using compute shaders instead of fragment shaders: you could have a classification kernel read your g-buffer in groups of 8x8 pixels, check if any pixel should have SSR on it, then append the tile coordinates/index to a buffer. Then your SSR kernel uses that buffer to figure out which tile each group is going to calculate the SSR for (it’s a bit more complex than that because you’d also need to increment the number of tiles in a counter to use as indirect dispatch arguments, and you’d need to use groupshared memory in the classification step to “coalesce” the results of all threads in each group).

That way, there’s no branch at all: a tile without any reflective pixels is simply not processed.

1 Like

About this, won’t there still be an if-statement somewhere in a fragment shader where I choose whether to do the SSR? I could definitely come up with ways to know ahead of time on a cell-by-cell basis whether any reflections will happen, maybe as an earlier render pass that puts it into a render texture or other buffer that saves the result. But if the SSR itself is conditional, it sounds like it might be done for every pixel anyway on Metal/Vulkan/OpenGL, since the [branch] keyword is only for HLSL.

Or are we more considering that the SSR is pre-computed for all pixels where it would matter and only conditionally applied in the fragment shader, i.e. reading it from a buffer and putting the value into the fragment output? Because then indeed, the fragment shader’s work is now reduced to a ternary.

The idea is that you would use compute to gather all the regions the SSR will be computed for, and then run an SSR shader/compute on that region collection. There would still be a conditional in the compute gather portion, but this would be a non-dynamic one (flat) that is just either assigning a value to the buffer or not and moving on. So you avoid the dynamic branch overhead and also aren’t computing complex logic in these branches, because the only logic it’s doing is deciding if a region should be added to collection or not.

GPUs operate best when the logic/throughput is uniform. So reducing the branching/decision making to the simplest possible pass can be very beneficial.

But again this can very much be a case by case basis, because you’re adding an extra pass over data. First the full screen and then the collection pass. Whereas right now you’re just doing all the work in the full-screen, which ultimately may be more performant in some camera situations, while less in others.

Hi,

Truth be told, I’m mixing some discussions that shouldn’t be mixed. The SSR I currently have running is for bodies of water. This naturally happens at the transparent stage, so I’m simply sampling the underlying buffer for refraction and reflection in screen space directly in the fragment shader. It works very well, but is only possible because I’m after the opaque stage. And every pixel is SSR, so there’s nothing to optimize.

Where I created confusion is that I was thinking to extend my SSR algorithm to more generally be used for puddles, but that actually wouldn’t work, because I can’t sample the opaque buffer while I’m still populating it. I could certainly adapt my algorithm into a post process and do it on the frame, but I have to ask if I’m really a good enough shader programmer for this kind of optimization.

I’m going to profile Kronnect’s Shiny SSRR tool, because it has per-mesh controls. This means that I could optimize by breaking up my meshes based on where puddles are capable of appearing. And then SSR processing is naturally focused, probably much like you propose.

Then I could use my SSR for water, and Kronnect’s for everything else. Kronnect’s can’t be used for water, because it doesn’t have access to water wave normals (Kronnect confirmed this was impossible), so this was how I got into making my own SSR in the first place.

Lots to think about. Thanks for your input!!

This guy on Stack Overflow has a point about if-else statements that might be a saving grace:

https://stackoverflow.com/a/45735032

His point is that it’s branch divergence that’s a problem, but if all pixels inside the same wavefront take the shorter branch, then the wavefront as a whole finishes quickly.

His logic is that all pixels in the wavefront are literally executing the same code, i.e. the program counter and fetched instruction is exactly the same in each pixel. So as I understand him, if you have a branch on one pixel, all the other pixels are waiting for that branch to finish before moving on to the next instruction.

If that’s the case, then indeed the SSR is self-optimizing down to the wavefront level, but then you pay for SSR for all pixels if a single pixel takes the branch.

Is this true or too simplistic?

This is true, but there’s more to it: GPU occupancy, which is a measure of how much of the GPU shader units are being actually used at the same time. The shader units use a shared pool of registers. The larger and more complex a shader is, the more registers it uses, which reduces the number of units which can execute at the same time.

When you have branches, the GPU must allocate enough registers for the worst case scenario, potentially reducing occupancy. Still, this is something you’ll need to measure in order to see whether it’s worthy or not.

I also just remembered some interesting trick: you can abuse a depth buffer to filter specific shaders to specific pixels by using the “equal” depth comparison, because the GPU will perform the depth test before running any fragment shading. So you can have a full screen classifier pass which writes a to a depth buffer pretty quickly, then have your SSR pass depth test to draw only on pixels with a specific depth value, having the hardware depth testing do the heavy branching for you.

This has some extra information:
https://vksegfault.github.io/posts/gentle-intro-gpu-inner-workings/

That’s a brilliant idea, I’m going to do that. I’m already using a similar things for rendering a water-level map into a render texture in a highly limited render pass, so that ground shaders know when they’re under water and can darken the ground, render caustics etc.

And a killer idea to let the renderer do the depth test. I’m going to noodle with that!

First time I saw that was in UE5’s Nanite, that’s how they render the materials as screen space passes, using depth values to tag which pixels use which materials.

Using the graphics hardware depth testing will probably reduce the number of actual shader groups being dispatched to the bare minimum since GPUs are very optimized at skipping fragment shading that way.

If you guys are still here, I’ve been studying up on SIMT/D execution (Single Instruction Multiple Threads / Data), and it’s raising some questions about some best practices that are widely suggested but seem like they aren’t necessarily true.

I’m following these videos:

So the conclusion is that the same instructions is executed in lock-step across all 32 or 64 threads. If you go into a branch, the threads have to stay in lock step, meaning that all threads that are going to take branch A are enabled in the execution mask, and for the Branch B code, another execution mask is set.

So basically, we’re not really executing both sides of the branch as is often suggested. If we have 10 threads on branch A and 22 threads on branch B, the the 22 B-threads are patiently waiting for A to finish, and then the 10 A-threads are patiently holding for B to finish.

What I get from this is that if no threads in a warp takes the A branch, we’re skipping over the A branch completely, disregarding the minor infrastructure of making the comparison and jumping the program counter.

And then here’s the question:

Is conditional compilation like #ifdef _WATER_ENABLED / #endif really such a massive improvement?

It seems to me that if every warp will take the same branch based on a C-Buffer value, it’s basically the same performance as conditional compilation, disregarding the execution of the comparison. So my hypothesis is that this is roughly the same performance as an #ifdef feature flag:

(Properties)
_waterEnabled("Water Enabled", Float) = 1

(Code)
if (_waterEnabled > 0.5) {
    // Expensive water code
}

Hypothesis is that this isn’t vastly worse than:

#ifdef _WATER_ENABLED
    // Expensive water code
#endif

This interests me because I’m starting to have a ton of conditional compilation and this is producing a lot of shader variants, as well as reduced runtime-flexibility.

So what is you guys’ opinion about more using keyword toggles for things that really are major and static features, but not feel too bad about leaving many global features as if/else statements?

I guess one problem I can see is that if this feature is enabled/disabled on a per-material basis, the GPU might not be able to tell the difference, and put pixels together in a warp where some have it on and some have it off.

But it sounds like it would hold for global features, and might even be attractive to have an if {} statement inside an #ifdef directive.

Because if all I do is an #ifdef _WATER_ENABLED, this enables the water calculation for all materials that are even just supposed to potentially support water, and now this calculation runs all the time, even if my global runtime setting is for a bone-dry landscape.

But if I also have an if (_globalWaterStrength > 0.5) { } around the water code, than none of that code will execute for any materials.

Seems advantageous to not purely rely on keywords to toggle features, but to also give the GPU a chance to opt out at the last minute if we know that no effective work will be done by the expensive water code.

Even if no thread takes the branch, the GPU still needs to reserve the registers for the branch (potentially reducing occupancy), and still needs to execute the instructions and read the variable needed to check the branch condition. Having the code not exist at all via compile time flags is always going to be faster.

One extreme example that shows this are the ubershaders in the Dolphin GameCube/Wii emulator:

They are gigantic shaders implementing all available features using branching, used while the specialized shaders are being compiled in the background, and are much heavier than the specialized shaders.

The impact of branching on performance is also going to vary depending on the GPU, so it may work just fine on your machine but run much worse than the specialized shader on another (specially on Intel GPUs).

Are we 100% sure registers are actually being loaded for a branch? There will of course be wasted CPU cycles on checking the condition, setting an Execution Mask and advancing the PC, but as I understand all the SIMT descriptions, the execution is serialized, with the branch not taken by any thread in a warp simply being skipped over because no thread needs to run it.

I’ll study up on the Wii Ubershaders, because if the penalty comes from branching, that’s obviously a superb real-world test of the hypothesis. I’m just wondering if the penalty couldn’t come from other things, for example instruction cache misses because of the larger shader code?

These Ubershaders are definitely a spot-on comparison. I’ll check it out now.

Thanks!

OK, done studying up. It is a very very good example. Please forgive me for continuing to probe this, I’m not sitting with folded arms, I only want to drill down to find out if this is still true under the exact conditions I’m gunning for. I truly appreciate your input.

What I can’t tell from the Ubershader discussion is whether the feature toggles they’re emulating with branches are actually global, or if they’re per-material. That would make a huge difference to performance, because if it’s global, every thread in every warp is taking the same path.

But if an Ubershader is representing a lot of pixels with different underlying feature toggles now right now being emulated with branching inside a single Ubershader, you could imagine half the threads in a warp taking the feature branch while the other half of the threads are waiting for the warp to reconverge. And the GPU wouldn’t be able to be smart about partitioning the warp, because to the GPU, this is all the same shader (until specialized variants are done compiling).

The very narrow case I’m trying to solve is when a branch will taken the same way by every pixel, either a global setting like “if (_waterStrength > 0)” or “if (_waterHeight > 0)”, where the same branch is taken by every thread in every warp.

So where Ubershaders could have a mix of these from pixel to pixel, a global setting or a universally arrived-at decision would not have a mix of branches from pixel to pixel.

So do you feel that the hypothesis holds for this narrow case? And this is again ignoring the cost of the branch condition, pushing the execution mask on the SIMT stack, and PC advance.

Yes, you can check how many registers a shader needs by using Nvidia’s and AMD’s GPU debugging tools.

(On PC the GPU driver may or may not peek at the constant buffer value used by the branch and generate variants of the shader without the branch, but I only heard rumors of that and never tested to see 8f that can actually happen).

Also “global” and “per material” are the same thing as far as the GPU is concerned: it’s all a value read from a constant buffer. That’s called an “uniform branch”, where the value used to decide the branch does not depend on thread-varying data (like UV coordinates, which can vary per pixel, or vertex data which can vary per vertex).