Oculus Quest - Worse performance with 2k ASTC 8x8 than 1k ASTC 4x4?

I’ve been doing some testing with different texture compression/res using the Oculus Quest and the OVR Metrics Tool. I’m using texture arrays and I’ve noticed that 2k textures with ASTC 8x8 results in MUCH worse performance than 1k textures with ASTC 4x4, despite both having a memory footprint of 1.3 mb. Even though both use the same amount of memory, can this still be a bandwidth bound issue? Is this somehow related to mip differences and tile caching?

As per the docs I read about ASTC, 8x8 and 4x4 should perform at the same speed at runtime. Only the compression times are worse. Decompression times should be the same (and the same as DXT1 too, even if ASTC has much better quality and weights the same).
It then makes sense that 2k textures have worse performance than 1k on severely bandwidth limited hardware since they are actually 4 times larger at writing time, but it should be more related to overall bandwidth of the whole scene render than a specific texture rendering.
I mean, as long as you don’t run out of bandwidth, rendering a 1k or a 8k texture should make no difference in performance.
Of course, this is the theory and I haven’t tested it with the Oculus Quest (and things can change a lot from theory to practice) (also, I could be wrong :stuck_out_tongue: )

On the Tegra X1 (nvidia shield TV and nintendo Switch) I have found that manually reordering the mesh rendering order so that closer objects get rendered first (avoiding overdraw aggressively) makes an ABYSMAL difference in performance (from 10fps to 60fps!).
I make it by throwing tons of raycasts from the camera each frame and setting the mesh render order manually on each renderer. Seems overkill but since the CPU is underused and the GPU bandwidth is so critical, sacrificing CPU for relieving GPU bandwidth is worth it.

I’m very interested on the Oculus Quest and plan to support it eventually but my game is very texture and bandwidth sensitive.
Have you discovered anything more of interest about performance on the Quest? Any info is welcome :slight_smile:

This sentence is nonsense.

“At writing time” is the compression, which happens “offline” in the editor, not on device. On the device the texture’s memory & bandwidth usage should be virtual if not completely identical.

When a texture is sampled by a shader, it has to be loaded from main GPU memory into the cache of the texture sampler unit. This is done by loading the least amount of data possible from a texture as possible, which is some number of single “blocks”. A single block from an ASTC 8x8 texture uses 128 bits, 8 x 8 x 2 bits per pixel. A single block from an ASTC 4x4 texture also uses 128 bits, 4 x 4 x 8 bits per pixel. Where memory bandwidth comes into play is in combination of how big the blocks are (in memory usage), and in how often new blocks need to be loaded. If multiple pixels in a row use the same texture blocks, no new ones need to be loaded.

One would expect an 8x8 texture with higher compression to reduce the frequency of the loads, or at least be no different than a 4x4 in the worst case, as their initial memory bandwidth usage should be the same. So either my understanding how this all works is missing something fundamental (entirely plausible btw) or larger ASTC block size formats take longer for some mobile GPUs to decode (also entirely plausible). Official Adreno documation has this cryptic line:
https://developer.qualcomm.com/qfile/34706/80-nb295-7_a-adreno_vulkan_developer_guide.pdf

They don’t explain what that means, but I’ve seen other documentation talk about Mali GPUs having variable time to decode texture formats depending on the format, with things like ETC w/ Bilinear taking a single cycle of the texture unit, and floating point or trilinear filtering adding more. This might be the cause?

Either way that’s quite annoying since the big draw of ASTC is the variable block size and the expected performance benefits that brings. If anything larger than 4x4 is slower to decode than the bandwidth savings they give, that removes most of those benefits.

Slow down bgolus… by “writing time” I meant at “render time”, when the GPU has to read the texture and then write it on the screen.
Rendering a larger texture will eat more bandwidth than a smaller one. On severely bandwidth limited GPUs, this can cut performance drastically.
On those cases, adjusting the mipmap rendering to force the GPU to use lower size mipmaps more often fixes the performance drop.
The benefits of ASTC are smaller size with better quality than any other compression available on mobile chipsets, not faster rendering. And if you have experience with mobile compression, the quality boost alone is totally worth it, since otherwise you would be forced to use uncompressed textures for a similar quality.

Well, not just on Mali… my experience is floating point and trilinear are slower on every mobile chipset I have tested.

1 Like

Sorry, that was a little flippy. :frowning:

Why? Like I detailed above a 2048x2048 ASTC 8x8 is the same size in total memory storage* and in block size memory usage as a 1024x1024 ASTC 4x4. Each shader invocation is going to be decoding a unique set of texels from the blocks. Even if the UV space being decoded is the same and aren’t mip limited, the memory bandwidth should be identical. What am I missing?

  • ignoring the handful of extra mips?

Shader requests blinear sample.
UV is the center of a block, mip 0, and texture is an ASTC 4x4, so a 128 bit block is loaded.
TU looks up the palette index of the 4 texels, decodes the color from the palette, interpolates between them, outputs a single color.

If you’re using an ASTC 8x8, or ASTC 10x10, or any block size, the above doesn’t change, and none of the numbers that affect memory bandwidth change. The TU doesn’t care if it’s a 2x2 texture or a 16k x 16k texture, it’ll output a unique color value for every sample, so the “write out” there doesn’t change either. Functionally this is no different than using an uncompressed 4x4 RGB texture vs a compressed (ETC / DXTC / etc.) of the same memory footprint and thus higher resolution. They should perform roughly identically. So I can’t see how this is memory bandwidth related.

Note, I legitimately want to know if I’m missing something here. I am not trying to attack you.

And I’m not trying to cast doubt that ASTC is slower than other formats, I’m well aware that it’s a significantly more complex format that any other GPU compression format out there. My knowledge of the specifics for how ASTC compression works compared to STC/DXTC/BC or ETC or PVRTC (all of which I understand pretty well) is lacking to say the least. But my current level of understanding would point to this being a TU silicon area consideration rather than related to memory bandwidth.

Could it not be the increased cache pressure? The texture block is decoded somewhere–presumably when it moves from main memory to the highest level cache? At that point the 8x8 case expands to 4 times the memory footprint of the 4x4 case. That quadruples the cache pressure and could eventually bubble up to the highest level cache requiring blocks to be fetched from main memory more often. On paper the bandwidth is the same, but the 4x4 case gets to cheat more because the texture memory it needs is already in cache that much more often.

And by “silicon area” do you mean the extra silicon required for the increased computational complexity of ASTC versus simpler texture compression formats? Suggesting the bottleneck is the point where the texture is being decoded? The 8x8 case would require more time to decompress?

And I am also just legitimately curious, and not necessarily arguing for any particular position.

No worries :slight_smile:

What you say makes sense, but the experience has teach me that, on mobile hardware, bigger textures will tank the framerate independently of the actual size in memory of the texture. (of course, I talk about compressed textures.)
I don’t know the intricacies of it, but I suppose it can be related to the texture cache somehow.
I’m no SoC engineer after all :stuck_out_tongue:
What I DO know is, nowhere have I read using 10x10 compression blocks would boost performance over 4x4. Even if technically it should, something must prevent it.
And about ASTC, I have read some technical papers about it and it’s actually quite close to DXT1 compression at runtime but more flexible. Where it differs drastically is in the “bake” process, trying lots of different compression combinations until it finds the best looking one for each compression block. (that’s why it takes AGES to compress)

That’s what I suspect. But until someone from Qualcomm or Nvidia shows up, we can only guess :stuck_out_tongue:

Definitively not the case, as per all what I have read about ASTC: it may be more expensive to manufacture but it’s not slower than any other format at runtime decompression.

-EDIT: well, now that I think about it, maybe 8x8 takes more time or ressources to decompress than 4x4? I haven’t noticed it in my project though (and I’ve been full on ASTC for more than a year now)

Yeah, basically. Simply theoretical unless I can find public technical documentation on Adreno or someone else who actually knows responds. GPUs are under a lot of pressure to simplify as much as possible to reduce their overall footprint & power usage. Desktop GPUs for example have enough silicon in them to be able to do trilinear (or even some levels of anisotropic) filtering for “free” in that there’s literally no difference between point sampling and trilinear as the sampler takes the same amount of time to do either. Mobile GPUs tend to take two “cycles” to do trilinear, basically re-running the bilinear sampling fixed function hardware twice.

I’ve heard from some people on the hardware side of things there was some complaints had with ASTC being chosen as the “official” GLES 3.0 format due to its complexity, and thus the amount of dedicated silicon it would need to decode. Hence why I wonder if ASTC 4x4 RGB / RGBA has dedicated silicon to decode in a single cycle, and other block sizes and formats maybe don’t and require two or more cycles, just to reduce the silicon usage for support. Even though on paper they should all be able to be decoded in equal time.

See this recent comment on twitter about ASTC:
https://twitter.com/richgel999/status/1179012491621236738?s=20

I did find a 2017 Samsung patent on low-power texture architecture with ASTC, and it seems like there are several different cases where the cycles to decode can range from 1 to 4, and it decodes between the L1 and L0 cache. The Adreno doesn’t necessarily work the same way, but I’m assuming it’s in the same ballpark.

It does seem to be the case that ASTC 4x4 decodes faster than 8x8, but the actual per-texel cost for decoding should be roughly the same because the 4x4 texture has to decode 4 times as many blocks for a given texel density. Mip-mapping ensures that the texel density is going to be mostly the same for the majority of the screen.

However, a 4x4 texture hits mip level 0 further from the camera, and starts making more use of the cache sooner. By the time the 8x8 texture hits mip level 0 the 4x4 texture has a 4:1 advantage and it maintains this advantage as textures gets closer to the camera. For a typical scene with a nearby wall or floors this could be a significant percentage of the screen.

My intuition is still that whatever the difference in decoding speed may be, reusing texels from the cache is always going to be significantly faster and the 4x4 texture potentially has the advantage of one-quarter the texel density for a large portion of the screen so I suspect that is the dominant factor at work here.

That being said, if anyone wants to go to the trouble of building some tests and grabbing some benchmarks, I am more than happy to be proven wrong in exchange for seeing them.

I’m definitely going to end up doing some tests like this soon, so maybe you’ll get your wish.

I found this OpenGL extension that talks about ASTC and cache pressure:

This extension will be supported on Unity 2020.1 on Android (I suppose it will be automatically applied to LDR ASTC textures)
It’s not exactly what we were talking about here, but it shows ASTC is actually decompressed to the cache. (EDIT: so bigger texture resolution = more cache usage and less performance, independently of the compressed size on memory)
This extension should boost performance significantly on texture heavy apps. Weee! :smile:

I was instructed to use smaller imagery (less compression) over larger /w more (when memory was comparable) due to cache miss - I thought it was because the sample had to run through all texels in Morton Order Z-order curve - Wikipedia

perhaps as mentioned, filtering before mapping too
does this paper help?

Texture Caches - Michael Doggett
Department of Computer Science -Lund University

1 Like

Mmm… I think I got it wrong.
I thought the sentence “reducing precision of the decoded result reduces the size of the decompressed data, potentially improving texture cache performance and saving power” implied the texture was decompressed to the cache, but that would be silly, right? :stuck_out_tongue:

Instead, I think what they actually meant is the texture cache will now have to write half the bits than before when decompressing the texture from the cache, consuming half the bandwidth than before.
So the texture is not actually decompressed TO the texture cache (which would be a huge waste of memory) but it gets decompressed right out of the texture cache to undergo bilinear filtering and rasterization, and that’s where the bandwidth limitation can hit.

So, YES, bigger textures will eat more texture cache bandwidth, however the bottleneck is not inside the texture cache, but in the output bandwidth of the texture cache (which has to write more data than with smaller sized textures), where ASTC is further penalized by having to decompress to 16 bits values instead of 8bits, thus doubling the bandwidth usage, adding insult to injure. (at least on Android)

Given the severe bandwidth limitations of mobile GPUs, using big ASTC textures can easily kill performance IF the bandwidth limit of the texture cache is reached. (if it’s not, you will not see a performance drop. Thus making this issue hard to notice early on.)
This should be greatly mitigated on Unity 2020.1 by the use of the new extension that lets ASTC decompress to 8bits per channel instead of 16, but larger sized textures will always consume more bandwidth than smaller sized ones, independently of their memory footprint. So you have to be extra careful when using them so to not go beyond the bandwidth limit or your framerate could be divided by two. :smile:

As I said before, you can optimize your way to big texture usage (to some extent) by very aggressively avoiding overdraw and adjusting mipmaps for smaller mips to kick in before they should.
Also, using the fewer textures per pixel the better, as this will clear texture bandwidth for your large textures.
At least, that’s what I do. And it works wonders. :stuck_out_tongue:

I forgot to say: That paper is the most concise GPU architecture overview I have ever read!
Thanks @Torbach78 !

What were your conclusions if I might be so bold to ask?

1 Like

https://developer.valvesoftware.com/wiki/Valve_Time

2 Likes

Did you ever get a chance to do some kind of testing this?

I’m in the process of porting my game to Quest and noticed that some of my 2k 8x8 textures look sharper than my 1k 4x4 textures and I wanted to use more 2k textures….but I don’t want to fall into a trap of much worst performance.

Sadly no. Shortly after this it was on to prototyping the next projects, and I’ve not had time to do any performance testing on the Quest. I was hoping to get ahold of a Quest 2 to do testing in that at the same time, but for various reasons that did not happen yet.