How to measure shader performance?

Hi everyone, i was trying to optimize a shader i was working on and then i hit a wall, to put it simply i could tell if the optimized code was actually better than the non-optimized.

For instance, is their any difference between using pow(base, 2) and base*base? For reason i’m assuming there’s an additional overhead using the pow function, how about two lerps versus a few arithmetic calculations including three divisions?

How can measure the performance of shader in very small changes?

1 Like

Well, I guess I would set up a simple scene with a piece of geometry and the shader applied to it, then measure the framerate, and if it’s way too little rendering to tell, draw the same geometry lots of times per frame (duplicate the object). If you have Unity Pro I guess you can use GPU profiling?

Also you could look at the compiled shader code and see how many instructions are being used.

Generally divisions are slow/bad… try to convert them to multiplications where possible. e.g. 10/5 could become 10*(1.0/5), which you can then change to 100.2. Some of the built-in language functions though do what they do as optiminally/optimized as possible so it’s good to use them instead of trying to reproduce their same functionality yourself… UNLESS you know better and can use the fact that you are informed about what needs to happen/what result you want/whats relevant and what isnt that you can take some kind of shortcuts or optimizations that the compiler wouldn’t be smart enough to do. In your case I would use basebase instead of power, I would be surprised if the power thing was faster. But testing it and seeing a real number to represent your changes is best.

Usually, reading textures is the slowest part of most shaders, the more reads you do the slower it gets. other instructions tend to be fairly fast provided you’re not on a mobile-style gpu which can get slow quite quickly if the shader is doing a lot.

Curious about your response. You said reading textures is slow. I am going to test something in mobile, but maybe u might have an answer already anyway. I am thinkn about using a vertex shader to blend textures or select textures by color of vertice… Example instead of 3 materials for 3 textured objects for 3 draw calls… 1 material with a vertex shader with 1 textured object combination that uses vertice color to select the texture.

Would reduce the draw calls this way be beneficial on mobile? Does having 3 textures in a shader for vertex shader hurt anything other then memory?

Your biggest gains are doing as much as you can in vert, this is always fast - even division. Branches are slow, and should be avoided wherever possible in both frag and vert. Frag should be as simple as you can possibly get it. Avoid POW and other math operations. Better to use look up texture if really have to do stuff like that.

1 Like

Thx for the responses, i tried the whole multiple objects test but it’s really hard to tell from the fluctuations, or maybe i don’t have enough objects on the scene, or the test itself is irrelevant

i already avoid using branches, what i usually do instead is “lerping” or “stepping” but it seems like a waste to calculation to me

from experience on iPad 1 working with multiple textures on the same shader is a big nono, something like 4 textures already drops fps significantly

any other thoughts on the subject would be highly appreciated

You can use AMD’s ShaderAnalyzer that will provide you some useful data about shader’s performance.
Note that simple float a = (b > c) ? d : e; is NOT branching and thus is as fast or maybe even faster than “stepping”.
Texture reads are tricky. If you just sample a texture with the UVs you get from a vert shader, then it’s basically free, because it can be pre-fetched by the hardware. On the other hand, if you must do some computation in the fragment shader to get the UV coordinates, then it’s a dependent texture read, which can halt the execution of the shader until the data arrives… the latency can be even >100 cycles, depending on the texture format and filtering.
That means, that these days it’s much faster to just do a pow() than a dependent, cache trashing, texture read, at least on desktop.

Depends how many texture units the device has, too. Old ipad1 might’ve been limited to 2 or 4? … Anyway… yes reading textures is a slowdown… but you can read more textures overall if you read more than one within a shader pass, possibly 30-50% more reading is possible, it seems. I would think the vertex blending of multiple textures should be faster than several draw calls.

read + read + read + write
is of course faster than
read + write + read + write + read + write

How does a pow() relate to or replace a texture read or uv manipulation? Or are you just comparing an expensive function with a texture read?

Pretty sure that depends on the platform/gpu.

It most likely is a branch on some hardware, or if you’re lucky, it gets turned into a step.

I’m comparing the cost of a once expensive function with a cache-trashing, bottleneck-introducing evil read of a precomputed pow lookup texture.

I’m 100% sure that on ALL hardware that even support dynamic branching (and I’m fairly certain older cards were fine with it too), it gets turned into a super cheap conditional assignment in one way or another. Ironically enough, your step actually compiles into the same conditional assignment.

For One Platform, solution:

Empty scene, 1 mesh, the shader full screen perhaps rotating, measure average frames of unity3d over 15 seconds!???!?!? i mean, a really heavy shader here slows down the framerate quite simply, and you can measure changes of 1 percent that way. takes 2 minutes to construct a shader benchmark scene.

Sorry for necroposting, I think it’s better then opening a duplicate thread.

Still didn’t find an answer: how do you measure the actual performance of each statement? How do you decide during a shader writing which way to do it would be faster?
What’s the final unit a performance it’s measured in? Is it # of instructions or what?

I’ve been a shader programmer for a while, but I steel need to use testing to tell what is faster.

How can I tell which would be faster? And how faster, exactly.
For example, is function call is faster, slower or the same as calling a macro?
How much pow() is slower then two, three, four, five… multiplications?
Hom much texture read (with “indirect” UVs) would be slower compared to pow()? And if compared to pow() and multiplication? pow() and divide?
Are two “clr *= someVar;” the same by speed as single “clr *= var1 * var2;” ?

If I need to change only rgb components in my frag shader, which would be faster:

// clr is fixed4
clr.rgb *= i.color.rgb;
return clr;

or passing “color.rgb” and “color.a” as two separate fixed3 and fixed variables from vertex shader and doing stuff like this:

// clr is fixed3
// there's also fixed alpha
clr *= i.color.rgb;
return fixed4(clr, alpha);

In short, still don’t get it how you can guess a performance of each separate piece of code before you actually wrote the entire shader end tested it in whole.

Half of your questions can be answered by looking at the compiled shader. For dx11 platforms, Unity unfortunately does not provide it in readable format. What I use, though, is the above mentioned AMD’s Shader Analyzer that lets you see the actual instructions behind the shader and even roughly how they perform on some (rather old) AMD cards.
The rest is just experience and general knowledge. Both functions and macros perform identically because in the compiled shader, there are no functions or macros, just straight up code. Two multiplications are also the same no matter what syntax you use.
In your last example, if you’re not doing anything else with the values, I’d say passing it directly as a vector of 4 is a tiny bit faster because you avoid a mov instruction in the end. But if you really don’t do anything else in the pixel shader, then it does not matter at all, because 99% of the time will be spent elsewhere in the pipeline… rasterizer, interpolators, ROPs… or even just waiting for the memory. Bandwidth is often the most limiting factor on desktop GPUs after all.

It is a lot more difficult to figure out the rest. If you look at the compiled code, you’ll see pow(x, y) is compiled into a log2, mul and exp2 instructions. pow(c, y), where c is a constant only need a mul + exp, pow(x, c) depends on the value of c and often gets turned into multiplications instead where possible.
Now, if you’re asking how much is an exp2 instruction more expensive than a multiplication, then there’s no satisfactory answer. It’s different between AMD, Nvidia, mobile GPUs, desktop GPUs, new cards, old cards… Even the compiler does not know, in the case of hlsl. It might have a vague idea though… like, it turns pow(x, 512) into 9 multiplications, but pow(x, 1024) into a log, mul and exp… Either way, you can see that it’s often safer to rely on the compiler to do low level optimizations for you.

Texture reads is a whole another beast. When a shader core “sends a read request”, it often does not just idly wait for the data to come back, it switches to work on other code that is ready to process instead, possibly even in a completely separate pixel shader invocation. Generally you should have a lot more arithmetic instructions than texture fetches for that reason. The delay until the data comes back is further dependent on the texture format and filtering used, and most importantly if it has been fetched before and is currently stored in cache. Spatially coherent reads are cache friendly reads. That means, if neighboring pixels need to fetch texels that are also close to each other, it’s much faster than if they are sampled all over the texture randomly. Since cache memory is limited, that also means reading from very small textures very often is relatively cheap - but most likely not cheap enough to make lookup textures for a single pow function worth it.

My point with all this is that measuring every separate piece is not nearly enough. Hell, even testing the entire shader on some random data isn’t accurate. You need to profile the shader in the actual, real scenario to get a good picture of the overall performance.

As a disclaimer, I specialize in desktop graphics, so whatever I just said might or might not be drastically different on mobile.

5 Likes

Do this, you just repeat your command you want to test loads of times to get a decent benchmark of speed.

Someone liked this post recently so I decided to clarify this a little with things I learned from @bgolus not long ago and that is: it’s no longer cut and dried with GCN+ architectures, since there’s a number of potential bottlenecks. So these days it’s more - use the right tool for the job, unless it’s a big job in which case hire the guy with the neverending story avatar and don’t look back.

2 Likes

To clarify a little, it’s no longer “cut and dried” with basically any Shader Model 4.0 or better hardware, including OpenGL ES 3.0 mobile GPUs and almost all desktop GPUs for nearly the last decade. The raw ALU performance (how fast GPUs calculate math operations) has far outstriped memory bandwidth. 10 years ago transfering a float4 from the vertex to the fragment had a greater cost than calculating that same data in the fragment shader from other data if it used ~8* or fewer instructions.

  • I honestly can’t remember the actual number, might have been high as 12 instructions.

Ten years ago!

An Nvidia GTX 260 bought in 2009 had an approximate performance of around 550 GFLOPs. A GTX 1060 is >4000 GFLOPs, so an almost 8x increase in ALU. The GTX 260’s memory bandwidth was 111 GB/s, the GTX 1060 is 192 GB/s, so less than a 2x increase in memory bandwidth. The RTX 2060 and Vega 56 GPU are only in the 400~500 GB/s range, so 4x increase in bandwidth vs 12~20x GFLOP numbers compared to the GTX 260.

Also, don’t try to hire me. There are plenty of talented individuals out there capable of writing shaders and I am already gainfully employed with little free time to devote to contract work … counter to what my post frequency might imply.

5 Likes