I mean, does calling clip() early in fragment shader affect performance at all on modern GPU?
I am writing a PBR shader with Alpha Clip Threshold support, and wondering I should hoist clip() before the PBR lighting call, assuming at least some of the pixels are going to be visible so computation can’t be skipped entirely.
My understanding is yes… on modern desktop hardware at least.
The key, as I understand it, is all pixels in a warp/wavefront need to discard for it to skip computations. I.e.: each group of 8 or 16 pixels (depending on the hardware) need to all discard for the GPU to be able to not compute the rest of the shader. If a single pixel in the group is not discarded, then it’ll run the full shader.
The bigger question is if clip() has a measurable performance impact even if it’s never called. I mean, for sure it has some impact, if nothing else the comparison is an instruction or two, but beyond that I’m not sure.
For shaders that write to the depth buffer, there is an impact to both rendering, and an additional cost to every subsequent tri rendered after that which tests against depth because using clip() or discard forces the GPU auto finish running the fragment shader before writing to the depth buffer, where otherwise it can be done before or during that, and forces the depth buffer to be uncompressed to some extent depending on the hardware. GPUs often store the depth buffer as just the plane equations for the tris, or some other lossless compression techniques, most of which have to be disabled once alpha testing comes into the picture. That makes writing to and reading from the depth buffer slower.
However for shaders that don’t write to the depth, none of that should come into play.
Thx. Looks like I should test it out, because my alpha clip threshold input could be simple patterned based and noise based. Which would result in quite different computation if this simple truth holds.
A follow-up question: I am rendering some text using SDF technique, would clip (alpha test) in general be slightly more efficient than transparent (alpha blend)?
Basically, I have already calculated the alpha value in fragment shader, I can either use it for clip() or output it as alpha, seems to me the clip() would be better but both Unity Text and TextMesh Pro use alpha (probably for other reasons, such as mobile gpu…)
Assuming you’re rendering solid color text, and not using a normal mapped & lit shader, there’s no calculations to skip once you’ve calculated the alpha value. That’s assuming the GPU would even skip any calculations; you’d only be potentially skipping them on desktop in 8x16 pixel areas where no single pixel is visible. Otherwise a clip() in itself is going to be either identical to, or significantly more expensive than a transparent blend, depending on the platform and if you’re using ZWrite On or Off.
I forgot to add a key detail: my follow up question is about UI (screen space), which means ZWrite is Off.
My question comes down to: since each character is rendered in a single quad, it means a large part of the quad will be alpha 0.
I gather you are saying alpha blend is so efficient on modern GPU when alpha = 0 (if it does blend at all), I shouldn’t try to use clip() because at best it still gives no performance gain over blend, correct?
Hmm … according to the official Nvidia docs (from 2014), they suggest using clip() even on alpha blended objects to skip the blend to help in ROP limited scenes. GPUs change a lot every few years, so it might not still be relevant, but I guess it wouldn’t hurt to try.
And Intel, which is a large part of my target platform, also seems to recommend this:
Use discard (or other kill pixel operations) where output will not contribute to the final color in the render target. Blending can be skipped where the output of the algorithm is an alpha of 0, or adding inputs to shaders that are 0s that negate output.
…
Minimize per-sample operations, and when shading in per-sample, maximize the number of cases where any kill pixel operation is used (for example, discard) to get the best surface compression.
Okay, so clip() / discard is bad for performance for a lot of reasons … if you’re writing to depth or stencil. But if you’re writing to depth or stencil you kind of usually need to use discard, so what can you do.
It seems as long as you’re not writing to the depth, AMD, Intel, and Nvidia all say “do it”. I haven’t tried to look at ARM, Apple, or Qualcomm suggest. In the past I’ve had PowerVR & ARM folks tell me to avoid clip() like the plague no matter what, but that may have been under the assumption that no one uses clip unless you’re also writing to depth.