Some more advanced shader questions

Hi there! I have a few questions regarding shaders and I hope some use people with more experience can help me out. So without further ado, here they are:

1. Branching
When using branching, if each 2x2 pixel block is taking the same path we are good.

  • What happens if that 2x2 block is discarded?
  • What happens if it is overlapped by another mesh opaque or alpha cut mesh (like some grass).
    Do we lose that optimization? Are those pixes shaded anyways even if overlapped?

2. Branching on high-ish-end mobile
In my tests, branching on mobile with uniform floats always seems to have a major impact on performance and I always avoided using branching. Unity seems to do the same in Birp/URP. Tested on Xiaomi Mi Mix 2, Snapdragon 835. Any ideas?

3. 2x2
Regarding branching, I got 2x2 from Jason Booth’s article: Branching on a GPU. If you consult the internet about… | by Jason Booth | Medium. Some articles say 8x8, some 32x32, some 64x64. Is this related to the architecture? Until today I only heard about 32 and 64.

4. Interpolators
In my shaders, I often use many calculations per vertex, but how the shaders are set, I use quite a few interpolators, 5 to 7 maybe. Let’s consider some simple math, like calculating a scale and offset for UVs and then passing it to the pixel shader. Can this interpolator be more expensive than the actual code?

5. Samplers
Let’s say I have 6 textures used for blending and the texture limit is not exceeded. What would you suggest and why in terms of performance:
a. Use 6 samplers, each tex with its own
b. Use 2 samplers for each Albedo, Normal, Mask pair

6. Noise texture vs math
Classic question. Using an uncompressed 256 noise texture sampled in world space vs a 3d noise (approx 80 math instructions when checked with Unity’s compile and show code), I get the same fps on mobile with a relatively complex scene. Some posts say sampling the texture is cheaper, I don’t see any difference in fps. Is sampling a tex that expensive or is the simple math used for noise that cheap?

7. Texture sampling size
Is sampling 32x32 pixel in a 32x32 texture as fast as sampling 32x32 pixels in a 4k texture the same in terms of performance?

8. Alpha Test
Why is Clip() that expensive that there is an option in HDRP to bypass it in Forward or Deferred pass, and used with early Z optimization in the Depth pass? And the difference is big if it is not performed in more passes. I see up to 10 fps increase if the bypass is enabled. Or it is related to early Z optimization?

999. Should I care that much?
In most cases what I noticed is that shaders are just super-fast and GPUs just too complex. I can throw a bunch of features in there and in most cases I don’t see much difference in performance. In most cases, I optimized shader by ear, trying to avoid too much texture sampling, using other textures as uv manipulation, Tring to add as much as possible to the vertex shader, and usually just being careful with what I do and how I can combine different features together to get the most out of my shaders. What is your approach, how do you debug performance?

Thanks. More will come for sure :slight_smile:

Those 2x2 blocks are called pixel quads. Taking different branches within the same pixel quad isn’t the end of the world. It just means all 4 fragments cost the same as all of the taken code paths.

The same rules apply. The shader code is still running before the discard. Think of discard as a special case branch. If all fragments end up hitting the same discard, great. If not, then those pixel quads are a bit more expensive as it’s running both the discard code path and the other code path for all 4 fragments.

If a single pixel is visible when rendered, the 4 pixel fragments of the pixel quad run. No matter what. Doesn’t matter if it’s another fully opaque surface or one using alpha testing / discard.

Mostly fine, especially when doing branches on material properties. But mobile devices aren’t as powerful as desktop GPUs, and branching isn’t entirely free, so the costs are more obvious.

The “same branch on all fragments in the pixel quad” rule is more accurately “in the warp/wave”. GPUs are SIMD processors that run a certain number of threads in parallel. How many depends on the architecture. AMD’s latest RDNA 2.0 for example can actually switch between 32 or 64 threads. So 8x8 groups of pixels all need to run the same branch, or they pay the cost of all code paths taken.

On desktop & consoles this is absolutely true. If you can recalculate a value in the fragment shader and pass less data it’s almost always a win. Mobile … is supposed to be similar but it’s less obviously true. I’ve seen massive perf improvements doing per vertex calculations and passing more data between the vertex and fragment on relatively recent mobile devices (Quest).

Use the one that’s easier for you to write. On Nvidia there can be a benefit to having unique samplers. On older AMD supposedly there can be a benefit to reusing one sampler for everything.

But it’s complicated, and generally I’d suggest not worrying about it unless you run out of samplers. On mobile it matters even less since that usually doesn’t even support separate sampler states.

Texture sampling can be a lot cheaper, especially if you’re looking to use a high quality noise. The old frac(sin(dot())) noise is stupendously cheap (but also really not very good). Perlin noise can be expensive to calculate you use a lot of noise octaves, cheapish if you limit to only 2 or 3. If you want blue noise you’ll want to use a texture.

Should be about the same.

Early Z reduces the cost of over shading. Over shading is when an object’s pixels are rendered, but do not appear in the final image because something else later renders in front of it. Grass and foliage are common cases where there can be a large part of the cost. A depth prepass can fill in the depth buffer with a very inexpensive fragment shader that has much less over shading cost. Then rendering the real shader later, even for “cheap” deferred shaders, means there are fewer fragments being rendered that don’t appear in the final image. The 2x2 thing still exists though, so it’s not 100% perfect.

Both of those things are very much “best practices” … for GPUs that mostly don’t exist anymore, at least on desktop. Dependent texture reads (sampling a texture using the results of another texture) we’re a big no no for the first generation or two of GPUs with programmable shader support. It basically stopped being a concern starting with later era DirectX 9 class GPUs nearly 15 years ago.

But they’re also not a bad thing to be mindful of for mobile, and good questions to ask in general.

6 Likes

Thanks Ben for the detailed answers!

9. Branching at Unity
I’m wondering why unity almost never uses branches in their shaders. If I check BIRP, I can only find 2-3 UNITY_BRANCH usages, mainly for cubemap blending and shadows. There are about 4 UNITY_BRANCH found in URP. And some more in HDRP. Why is unity so conservative when it comes to branches?

10. Branching in HDRP
HDRP is supposed to run on modern hardware, so I was expecting some more branching, but instead, they use tons of ifs instead. Ifs without a branch will run both paths, right?

Unity’s BIRP shaders were originally written some 7 years ago and needed to support a very different landscape of GPUs. Unity didn’t even have full DirectX 11 support when many of the shaders were written, so Direct3D 9 and OpenGLES 2.0 was the assumed targets. Neither of those have very robust (or even at all existing) branching support.

Today’s Direct3D 11 & 12 class desktop GPUs and OpenGLES 3.1+ mobile GPUs are far better at branching, though it’s still not always great on mobile hence it’s quite conservative usage in the URP (which still has to support mobile and the Nintendo Switch). HDRP assumes Direct3D 11.1 or better, hence it’s far more frequent usage.

Just because there’s no UNITY_BRANCH or [branch] (which is what that macro is) in the shader doesn’t mean there aren’t branches. Any if statement or for can be a branch if the shader compiler decides to make it one. And indeed any for loop that doesn’t have a fixed count will be a branch on a modern GPU. And on older graphics APIs they would have been compiler warnings or even errors on certain platforms! URP’s light loop for example is fixed light count on low end mobile, but dynamic for other platforms. If statements will be a branch if the compiler thinks it’ll be more efficient to be a branch or not, the [branch] is just telling the shader compiler you really, really want this part to be a dynamic branch no matter what. AFAIK that only works for Direct3D. GLSL has no option to let you force a branch or not, so it’s always up the compiler. I’m not sure how Metal or Vulkan handle things. I believe Vulkan you have to be explicit about if you want a branch or not, but Unity generates shaders for that target by converting the output from the Direct3D shader compiler into SPIR-V (the shader format Vulkan uses), so you’ll likely get whatever decisions that compiler made in Vulkan. Metal may be the same, but I don’t know.

And as a final mind-funk … using functions like step() or inline conditionals like foo > 1.0 ? bar : baz can still compile into branches if the compiler decides to.

Alright, this explains a lot. I always thought you need to explicitly use UNITY_BRANCH so the if will be an actual branch. I was wondering why having it added or not, behave the same way, and this explains it. It was just a quick test I did and didn’t actually check the compiled code. Thanks again for the explanations!

10. Mipmaps
If no mipmaps are used for a texture, is using tex2Dbias with a bias of 0 be more optimized, or it doesn’t really matter:

finalColor = tex2Dbias( _Albedo, float4( uv_Albedo, 0, 0.0) );

This likely depends on the hardware. This is definitely something I’ve seen recommend frequently in the past, but I’ve never actually been able to measure a perf difference when using a texture without mips in the use cases I’ve had.

Certainly for some hardware I would assume it could have a perf advantage as it is explicitly not having to calculate the derivatives. But some hardware may be smart enough not to do that for textures without mipmaps already.

1 Like

The compiled code is definitely different, all I could find it this:

sample_l ignores address derivatives, so filtering behavior is purely isotropic. Because derivatives are ignored, anisotropic filtering behaves as isotropic filtering.
https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/sample-l--sm4---asm-

11. Sampling per vertex vs per pixel
Is sampling per vertex and then passing the value to pixel more optimized? Or sampling is the same regardless? What happens to the cache if sampled per vertex?

Oh yeah, compiled code will be different. But that doesn’t guarantee a measurable performance difference.

Though I should note I misread your question, and I suspect you may have miswrote it. tex2Dbias() still uses derivatives and is equivalent to sample_b. A tex2Dbias() with a bias of 0 isn’t actually any different than tex2D() and I would expect some compilers to . sample_l is equivalent to tex2Dlod(). Also counter intuitively, it is plausible for tex2Dbias and tex2Dlod to be slower than tex2D in some use cases / hardware because both pass more information from the shader to the texture units, so if you’re memory bandwidth limited vs sampler time limited the “faster” option of tex2Dlod() could be slower.

Again, this will depend on the hardware. This was a common optimization in the past, and still potentially one for mobile. But it will absolutely be thrashing the cache as the texture sample positions are almost guaranteed to not be contiguous, unless you’ve taken care with the order of the vertices and the UV positions you’re sampling from (and mesh optimization doesn’t change too much / the GPU’s execution order of the vertices works in your favor). Personally I don’t think of per-vertex vs. per-pixel texture sampling from a performance point of view, but rather what the end goal is since the end visual results will be very different. But on modern desktop GPUs it’s almost always cheaper to pass less information from the vertex shader to the fragment shader even if it means recalculating a lot of data in the fragment shader than it is to pass that data. In my experience mobile still seems to have performance benefits in passing data from the vertex to the fragment. It also depends on the density of the mesh; if there are more vertices than pixels, doing stuff per-pixel will always be cheaper, but also maybe think about using mesh LODs at that point.

1 Like

More vertices than pixels in unity? In a standard fullhd screen scenario? No, Unity can’t handle that :smile:

Thanks for the insights! I used tex2Dbias thinking it works the same as tex2Dlod.

The fact is I’m trying to understand some of the low level stuff but in most cases it boils down to hardware and compilers. In my case, I have shaders with a dozen of features always On, 3 main textures, 3 details textures, 4 global texture arrays, 1 emissive tex, 2 3d noise textures, plus all the internal unity textures. And if instead of all these, I just use a constant color for the albedo, no other features, I end up with the same frame rate, on a consumer GPU.

It is more a quest to understand how things work and where things can be improved.

By what measurement?

Unity’s fps display is showing the CPU framerate. If your framerate doesn’t change between drastically different shaders, it’s probably because you’re CPU limited and not GPU limited for what you’re rendering. You’d need to use some kind of GPU profiling to see the difference.

I’m usually using straight-up fps measurements (the same as I do with “instructions” using the unity shader compiler), and since it is HDRP it is for sure CPU bound because my scenes are quite simple. But since I develop for the store and my scenarios doesn’t count anyway, plus my customers don’t report performance issues due to shaders, I guess it is fine :slight_smile:

12. Is the GPU automatically culling triangles that have “zero” size?

Basically this, or some distance-based size fading:

I can see a huge difference in render doc, but I assume is it is because there are fewer pixels to shade.
I also see performance differences on mobile when distance size fade is used.

In the recent article from the unity blog, they say this:
_*https://blog.unity.com/games/experience-the-new-unity-terrain-demo-scenes-for-hdrp-and-urp*_
One thing worth noting, however, is that the LOD Group component is not compatible with Terrain details, though you can still use Prefabs for details and tweak the cull distance in the Shader Graph shader or via the Detail Distance setting on the Terrain.

The culling distance in SG is basically just a simple distance size fade:

Yes.

But that’s not something RenderDoc is capable of showing you.

RenderDoc can show you how long things are taking, but not necessarily why. Event using something like Nsight or other bespoke low level profiling tools which give much more granular information about what the GPU is doing won’t really be able to show you this.

The problem is at the level any of these tools will let you see, there’s not really a difference between a triangle that’s so thin or small that no fragments are rendered and a triangle that is infinitely small. The rasterization hardware is a black box that takes the vertex positions and the only output is how many fragments run afterwards. How that hardware chooses which fragments, and what optimizations they have to speed up those calculations beyond the basics of vanilla rasterization, probably falls under the realm of industry secrets. Render doc can tell you how many vertices a mesh has, how many triangles, and how many fragments of a mesh ends up in the final image. But it can’t tell you how many of those were actually executed with any accuracy. Even the fragment execution count is a guess as it can’t 100% differentiate between fragments that were skipped via early depth stencil rejection or simply not included in the final image due to late depth stencil rejection with 100% accuracy, it can only make an educated guess based on the render state and high level hardware capabilities. Even the vertex execution count is a guess assuming it’s the number of vertices passed to the GPU, which isn’t strictly accurate as some GPUs will process some vertices multiple times!

But the TLDR of this when I’ve asked “people who know” (i.e.: people who have themselves worked on the hardware directly or whom can ask those people and get the answer) is, yes, infinitely small triangles are absolutely skipped on all commonly used GPU, and skipped in a way that’s faster than triangles that are just too thin or small to be visible for the current resolution. Similarly vertices with a NaN position will cause triangles that use that vertex to be skipped for all commonly used GPUs.

This is good to know! I always assumed scaling the vertices to 0 from the shader is the same as rendering the object. I even added an info for people to use a culling system.
7921315--1010914--upload_2022-2-23_20-29-1.png

Speaking of Render Doc, I have a shader that uses a branch to skip sampling a few textures. When checking the timings on 2 objects, 1 sampling the textures, 1 not, the is a big difference in timings, but RenderDoc always shows the textures in the Pixel Shader. So I assume, it just shows all the textures, all the time.

Thanks again for the detailed explanations!

7921315--1010920--upload_2022-2-23_20-38-1.png

This might be getting pedantic, but scaling the vertices to zero from the shader is the same as rendering the object. You’re still paying the full cost of rendering it on the CPU. Full cost of calculating the vertices (ignoring the possibility of branching to a fast path on “hidden” vertices). The only thing you’re really saving is the cost of shading them and potentially a minor reduction in initial rasterization. This can be a non trivial cost depending on at what size on screen you start to cull them, but don’t discount the rest of the costs you’re still paying.

Because the pixel shader still has two textures bound. Branches don’t change that fact. Can’t change that fact. It can only change whether not they get sampled.

So on Unity shaders and branching:

As bgolus pointed out, most of these shaders were written for very old hardware, and URP is essentially a port of these shaders in many ways. And they still have to support low end hardware like the Quest, which is particularly sensitive to GPU cost. So Unity has to be very conservative and keeps their shaders fairly simple.

That said, another consideration is the new SRP batcher. The SRP batcher can handle multiple materials running the same shader variant, but each new shader variant causes a new batch. So depending on your feature set, it might be more efficient to use branches than variants for small feature changes. This is creating a bit of a dilemma for me on Better Lit Shader, which currently favors variants for everything. For instance, if the user uses brightness/contrast on some materials vs. others then that currently creates new variants, which breaks batching. So is paying ~7 cycles for a branch a better option there? Kind of depends on how the shader is used across your scene. If you have 20 such options as branches, that might completely kill performance on low end hardware. But that making hundreds of batches instead of one might also hurt performance in another way.

Another thing I run into is constant buffer size, and I haven’t done any measurements here to know if it’s really a problem or not (I suspect it’s not that big of a deal). Better Lit Shader has hundreds of features, and you can’t #if #endif anything in the CBuffer because it will break batching, so every material has to upload a massive constant buffer. I could see that making the Set Pass calls a bit more expensive, but again I don’t know by how much.

Another thing I need to time soon is just how bad extra interpolators are on the Quest. I’m currently working on a Quest project, and Better Shaders (along with all of Unity’s shaders and Shader Graph shaders) assume you need certain things like texcoord0 or the tangent. However, many of my shaders for this project use a MatCap style lighting system and do everything in worldspace, so don’t have need for either of these.

I noticed branching with textures can be tricky. For testing I use PIX lately as it can show if a texture is used or not. I use your derivates trick from Medium. All good and nice on pixel shader, but it always samples the textures on vertex shader (or pix is just showing it, or the GPU decided it is not worth branching, even if [branch] is used, possible?). If I use shared sampler it seems to work, but that can cause other issues when the sampler is not used. I think you have a workaround for that too . So by doing a lot of tests in PIX and RenderDoc, I never found any benefits and a good workflow of using branches so so far I don’t use them. I probably don’t have single feature that don’t depend on some textures.

As for variants, I keep everything at minimum in my store shaders so SRP batcher can do its work. On high end GPU, having all the features enabled is fine in my tests. If the users want simpler shaders, then can also disable them in amplify. I always use a big shader function with tons of options you can toggle.

Not sure about the CBuffer, I have a ton of properties, not sure how those impact the performance. I use a few interpolators for and they seems to increase the performance a bit on mobile.

PS: I’m still waiting the day I can use Better Shaders in ASE :stuck_out_tongue: