Shader Performance Benchmark

Hi All,

So we’ve all heard the rumers, dependent texture lookups, lerp and pow are evil in shaders…but how evil are they?

Well I figured it was time to do some tests and here are the results:

Run on a Nexus 7 - 2013, Unity 5.1

Seconds for 10 times overdraw fullscreen…
0.0188 = Base
0.0188 = 5 extra lookups same uv
0.0972 = 5 uv same position and texture
0.0708 = 5 uv diff position and same texture
0.0955 = 5 dependent reads and same texture

6 different textures
0.0995 = 5 extra lookups same uv
0.0995 = 5 uv same position
0.0720 = 5 uv diff position
0.0966 = 5 dependent reads

0.0512 = Lerp Test (10 lerps)
0.0561 = Fast Lerp Test (10 fast lerps)
0.3029 = Pow Test (10 pows)

0.0209 = Normalize (10 times)
0.0190 = Add test (10 adds float4)
0.0193 = Multiply test (10 times float4)
0.0193 = Divide test (10 times float4)
0.0872 = Matrix Multiply test (10 times float4)

0.0294 = Cubemap test (Unity_GlossyEnvironment ran 5 times)

Base = Simple skybox, UI with a selection of tickboxes and FPS / Second counter, and 1 simple texture lookup, so each of the above had 1 texture lookup on top of whatever it did, there was no depth writing and the screen was overdrawn full screen 10 times, so as an example Normalize test did 10 normalises per pixel in the shader, then an additional 10 times per instance, so 100 times total per screen pixel!

So it was a pure GPU power test, as you can see Pow is the real evil here, 6 times more expensive than lerp! So try and avoid it if you can. I’ll retest some other devices and post more results soon, but I figured start with a mid range Android as it’s what most people are targeting for mobile builds, let me know if there are any other commands than interest you!

I should point out fast lerp was:

inline fixed4 FastLerp(fixed4 x,fixed4 y,float t)
{
fixed4 z=y-x;
return z*t+x;
}

Someone posted this would be quicker…they was wrong, at least on the Nexus 7.

Jon

1 Like

Astro Cyclone (Mali-400 GPU) this is a very low performance machine
Seconds for 10 times overdraw fullscreen…
0.0260 = Base
0.0709 = 5 extra lookups same uv → Increase is caused by the extra math, see comment below!
0.1476 = 5 uv same position and texture
0.1337 = 5 uv diff position and same texture
0.1706 = 5 dependent reads and same texture

6 different textures
0.1470 = 5 extra lookups same uv
0.1476 = 5 uv same position
0.1339 = 5 uv diff position
0.1916 = 5 dependent reads

0.1285 = Lerp Test (10 lerps)
0.2450 = Fast Lerp Test (10 fast lerps)
0.3333 = Pow Test (10 pows)

0.2565 = Normalize (10 times)
0.1285 = Add test (10 adds float4)
0.1290 = Multiply test (10 times float4)
0.1751 = Divide test (10 times float4)
0.3333 = Matrix Multiply test (10 times float4)

0.1230 = Cubemap test (Unity_GlossyEnvironment ran 5 times)

As this device struggled with the Pow and Martix test I re-ran these with 1 of each:

0.1170 = Pow Test (1 pows)
0.2914 = Matrix Multiply test (1 times float4)

So Matrix and Pow in pixel shaders if you are targeting low end Androids is a complete killer!

2 Likes

You should also try transparent shader, and alphaTest shaders. I’m curious to see if it’s still as hard to render as it used to be!

1 Like

I’ll do this! I have a feeling the transparent objects will perform slower, but a comparison would be interesting so we know how much we can get away with!

A lot of this won’t make intuitive sense. Your tests are flawed because:

  1. unity’s compiler will change the output (check the compiled source for what it’s actually doing)
  2. accessing a texture but not using it can get optimised out and appear as not being a performance hit
  3. the order you fetch textures in frag matters. Stacking texture reads in one go is generally faster on mobile as the latency hit of the first texture read can cover subsequent reads.
  4. calculating a uv value within frag ie parallax prevents the driver from prefetching the texture for frag in the vertex stage.

A lot of these come under optimisations but generally, it’s easy to widely throw off benchmarks with mobile gpus. It’s not all cycles.

And worse, it will change per driver (os/platform).

For good practises, go to the power vr (imagination) website and read their optimisation pdfs and guidelines, they have a whole bunch of them there and these tips generally work well for other mobile chipsets.

Alpha testing is slow on tile deferred hardware because it can’t get optimised. Pow and other high maths funcs are slow on all GPUs which is why we use look up tables (even on desktop sometimes, depending if overdraw is a factor).

Transparent isn’t slower than opaque, if there is no overdraw. If there is overdraw it cannot early discard any pixels like it can with opaque, which is why the general advice for transparency is to use a lot of big particles as opposed to a lot of overdraw.

This is actually quite an arrogant response, it assumes two things, firstly that you know way more than I do which you may well do, but don’t presume to when you don’t know me, and second that you know how the tests are conducted which you don’t.

I’ll give the benefit of the doubt and assume you we’re not being arrogant but just brain dumped.

→ I did

→ They are used, which is why there is a base time showing the simpler version so we can offset it.

→ I see no evidence of this when the reads are in the same fragment, across different pixels then yes I could see this being the case due to the parallel processing, but not in the same fragement, on the other hand due to the way they are prefectched we should see no performance drop, and so far my tests show this, but I’ve never seen anyone actually provide test results of this. Generally I’d group the reads for neatness and readability, I’d love to see where you got this information, having literally just tested this on 2 different GPU’s I see no difference between doing then one after another then lerping all the results and introducing a lerp between each one that uses the lookup value.

→ Yes this is widely known, which is why I included it in the tests, what I’ve never seen is anyone test this to see what the performance drop is.

→ Which is why I’m testing different devices and platforms, starting with Android as it is the most volitile and the one of most interest, biggest mobile market and all.

They do, for Power VR, most of this doesn’t always cross over to all platfroms, and they provide no statistics as to what affect each situation can have compared to others…for examle if you have no choice but to do a dependent read or a pow, which is worst and by how much? They say to avoid them both so which to choose if you have to pick one?

So here you mention using LUT, but thats a dependent texture read, so what operations are better with a LUT than the original? Well thats what I hope to find out and more importantly how consistent it is across devices and platforms.

This is incorrect, Transparent will be either blending or disgarding both have overhead and so will be slower, how much slower though? Well thats what I intend to investigate.

Despite all my responses one thing does bother me in your responses:

I have a real life shader which performs a little slower than my texture lookup tests both are pre-fetching and I have a feeling there might be something in this which I may have missed on the texture read tests, so I’ll recheck these and post updated results if need be, it may mean adding more instructions to the base in order to offset those of extra lookups.

Jon

1 Like

Okay so following on from the comments above by hippocoder:

I reran the tests with some basic math performed on each texture, the same math is performed on the “5 extra lookups same uv” run to make the comparisons fair, but not on the non-texture benchmarks (like pow etc or Base) This makes the texture reads indeed take bit longer. The basic math is the add all the lookup results and divide by 6 before outputting.

Jon

Was a quick dump… as bad as that sounds :smile:

PDF from imagination’s website, they have quite a few. Tested on iPhone 3GS and Vita as working, but only when Unity’s output plays nice. 4.x I believe it was. Only classic vert/frag not suface.

Ultimately the rules do change all the time due to driver (and hardware) changes on mobile as you’re no doubt aware.

Nice! :slight_smile:

Yes, I think what I’m trying to look at is how it affects the different platforms, there are great resources for Android and iOS, but nothing for BlackBerry and Windows Phone or Store (Mainly RT I suppose, as the rest should be powerful enough) so what globally applies and what do we only need to be cautious of for each, and mainly we need to consider low performance devices, but since a lot of these documents were created devices have really changed, so what still applies. And none of them mention the what level of performance drop some of these commands have, ie how expensive they are, it’s all well and good saying they are expensive, but how do they compare to each other, while there will be differences I think there will also be clear areas that show general cases, ie so far lerp is not as bad as people on forums generally say, I’ve seen it mentioned a few times to avoid it in shaders.

1 Like

HTC Desire C
Seconds for 10 times overdraw fullscreen…
0.0405 = Base
0.0666 = 5 extra lookups same uv
0.1570 = 5 uv same position and texture
0.1549 = 5 uv diff position and same texture
0.1805 = 5 dependent reads and same texture

6 different textures
0.1569 = 5 extra lookups same uv
0.1569 = 5 uv same position
0.1549 = 5 uv diff position
0.1805 = 5 dependent reads

0.2617 = Lerp Test (10 lerps)
0.2617 = Fast Lerp Test (10 fast lerps)
0.3333 = Pow Test (10 pows)

0.3333 = Normalize (10 times)
0.0792 = Add test (10 adds float4)
0.0792 = Multiply test (10 times float4)
0.0925 = Divide test (10 times float4)
0.3333 = Matrix Multiply test (10 times float4)

0.1401 = Cubemap test (Unity_GlossyEnvironment ran 5 times)

0.1280 = Pow Test (1 pows)
0.0671 = Matrix Multiply test (1 times float4)
0.0542 = Normalize (1 times)

This one is even slower than the Cyclone, I was surprised to see how poor Normalize performed on this device, but still Pow is the real killer! Well see soon how other platforms compare, Windows Phone is up next!

1 Like