ForwardRenderer unnecessary frame buffer load / stores

This is primarily an issue on TBDR architectures where load / stores are expensive. These extra loads / stores are extremely expensive on those architectures, I will give numbers below, but here is an outline of what is happening in the forward renderer:

Outline:

  • Shadow Map Passes

  • Depth Pre-Pass (Not executed with my config)

  • Store & Resolve (Haven’t tested but based on how other passes work, probably a safe assumption)

  • Probably causes unneeded load of depth buffer

  • Opaque Forward Pass

  • Potentially load pre-pass depth

  • Store MSAA

  • Resolve & Store [Unused in subsequent passes]

  • Skybox Forward Pass

  • Load Stored MSAA [Unneeded Load]

  • Store MSAA

  • Resolve & Store [Potentially unneeded]

  • Opaque Capture pass

  • Load resolved from skybox pass

  • Downsample & Store

  • Transparent Forward pass

  • Load Stored MSAA Framebuffers from Skybox Pass [Potentially unneeded load]

  • Store MSAA

  • Store Resolved [Unused]

  • Postprocessing

  • SMAA

  • Store MSAA [Unused]

  • Store Resolved

Timings: [iPhone 8s]
Used Xcode frame capture tools for analysis and timings.

  • Opaque Pass: ~7.6ms

  • Draw call total: ~5.3ms

  • Store overhead: ~2.3ms

  • Skybox pass: ~4.5ms

  • Draw call total: 30us

  • Load / Store overhead: ~4.4ms

  • Opaque capture pass: ~0.5ms

  • Transparent pass: ~3.5ms

  • Draw call total: ~300us

  • Load / Store overhead: ~3.2ms

  • Post process SMAA: ~2.5ms

  • Draw call: ~1.8ms

  • Store unused MSAA overhead?? <.7ms

Fixes:

  • Combine Opaque & Skybox pass

  • Removes store overhead in opaque pass: ~2.3ms

  • Removes load overhead in Skybox pass: ~2.2 - 4.4ms (2.2ms if opaque capture pass)

  • Potentially combine Opaque, Skybox and Transparent

  • Only possible if no opaque capture pass

  • Save an additional ~3.2ms on top combined Opaque & Skybox

  • Post Process SMAA:

  • Don’t store MSAA

  • Savings: < .7ms?

  • It seems like the default is store & resolve after each RenderPass?

  • Should only store based on subsequent pass needs?

3 Likes

Hey!

There has been quite a lot of TBDR MSAA bandwidth optimization work for the 21.2 and 22.1 releases and most of the load/store issues have been addressed.

Some of the main changes:

  • MSAA depth resolve capability has been added to Vulkan and Metal

  • A new “Store Actions” optimization option has been added in this PR: Choosing “Discard” hints URP to prefer “DontCare” store actions when possible. The main reason we are still giving the option to “Store” is to make sure projects that rely on a specific RT to be stored for reuse later in the frame will still work.

  • A new “Copy Depth Mode” option which allows the user to specify if the scene’s depth texture should be copied after the transparents pass has been added in this PR (which also introduces MSAA depth resolve support). Choosing to copy the depth after the transparent pass instead of the opaque pass (the old behaviour) allows URP to merge Opaque, Skybox and Transparent passes, requiring much less store operations and reducing greatly the bandwidth usage

  • Depth prepass have been disabled and the scene depth is reused or copied instead, when possible

You can check the results and GPU captures of the optimizations in the PRs descriptions

As we are keeping working in this area expect more info and documentation to be released soon

4 Likes

Hey Manuele!

Thanks for the reply! The improvements you have been working on look like they should address my concerns and will be a big improvement for TBDR devices. Thanks for all of your hard work!