Graphics memory optimizations for DX12, Vulkan and Metal

Greetings from the Unity graphics team,

Unlike older APIs (DirectX11, OpenGL), modern graphics APIs (DirectX12, Vulkan, Metal) provide low level control over memory allocations and transfers. This introduced new challenges and opportunities for reducing the memory overhead.

We observed that high memory usage can be a common cause for lower GPU performance on modern APIs, and have been working on new optimizations which we would like to share. Along with general tips.

Before looking at these optimizations, it is useful to quickly review the memory architecture and characteristics of different platforms. Along with the impact of memory consumption and bandwidth usage on GPU performance.

GPU memory architecture at a high level

Discrete GPUs, like those found in your average gaming PC, include their own dedicated graphics memory. Which the GPU can access with very high bandwidth. This allows the GPU to read and write a huge amount of data all at once. Powerful gaming GPUs can have memory bandwidth of over 1000 GB/s.

Integrated GPUs, like those found in mobile devices and PCs without dedicated GPUs, utilize a shared / unified memory architecture. During rendering the GPU will read and write directly to system memory when needed. Mobile GPUs are often constrained by lower memory bandwidth which has to be shared between the CPU and GPU:

While the dedicated GPU enjoys very high memory bandwidth when accessing VRAM, transferring data to system memory will be limited by PCI express bandwidth. PCIe 4.0 with x16 slots has a limit of around 32 GB/s (unidirectional), much lower than the several hundreds of GB/s found in modern GPUs:

Exceeding the VRAM budget on dedicated GPUs will lead to memory bandwidth bottlenecks. The GPU will frequently run out of work to execute, and shader cores will sit idle while waiting for the needed data to be read from system memory. This leads to wasted GPU cycles, lower throughput and a large GPU performance hit.

Common reasons for high memory usage and general tips

Graphics memory usage will grow based on factors such as:

  1. Render texture resolution, affected by the render scale
  2. Number of render textures, affected by the render pipeline complexity
  3. Size and number of 2D textures
  4. Density and number of 3D models
  5. Number and size of allocated graphics buffers

You can minimize graphics memory usage by optimizing your texture and models. Enabling texture compression will reduce texture memory and bandwidth usage. When using a large number of textures, we recommend using Mip Map Streaming which allows you to configure the runtime memory budget for texture and mip-levels: Unity - Manual: Configure mipmap streaming

You can also reduce the Render Scale to scale down your render textures. Upscaling filters such as Unity’s STP and AMD’s FSR can then be used to preserve image quality:

One thing to consider is the behavior of Dynamic Resolution, which will currently allocate render textures using the maximum output resolution. To minimize render texture resolution, use “render scale” instead. We are actively working on a Dynamic Res optimization to limit the maximum RT resolution to fit the maximum fixed upscale resolution.

Buffer allocations are trickier and can come from numerous sources:

  • Graphics buffers, created by rendering code and scripts
  • Graphics buffers, created by systems such as Shuriken and VFX Graph
  • Scratch buffer allocations, created by the engine when recording draws and when binding new shaders

In your own rendering scripts, it is generally advised to create a smaller number of larger Graphics and Compute Buffers. Also be mindful of the number of Shuriken and VFX Graph effects you create, as these can lead to a sharp increase in memory usage. When using VFX Graph, we recommend you utilize the new Instancing option to reduce memory usage: Instancing | Visual Effect Graph | 17.4.0

Scratch buffer allocations can grow significantly with poor draw call batching. We recommend you make sure the SRP batcher is being used effectively, to minimize the number of set pass and shader binding calls. Note that the older “Built in Render Pipeline” does not support the SRP batcher. You can use the Editor’s Statistics window to track the number of draw calls, batches and set pass calls. This view is now updated in Unity 6.4 to provide more accurate and useful metrics:

Tight alignment for DX12

The number of scratch buffer allocations can quickly grow based on draw calls and shader binding calls. We profiled numerous projects and saw up to 30,000 scratch buffer allocations in more complex PC games.

Historically, the DX12 graphics API enforced very strict minimum buffer alignment requirements. Unlike Vulkan, buffer alignments for DX12 were fixed constants, and not defined based on the GPU and its reported capabilities. DX12 forced a 64KiB alignment for buffers and textures. While D3D12_SMALL_RESOURCE_PLACEMENT_ALIGNMENT exists, it cannot be used for buffer allocations. Only for small textures that fit certain requirements.

This can lead to significant memory waste which grows with the number of resources. The limitation is now addressed in the latest Agility SDK version which introduces support for Tight Buffer Alignment for DX12: Agility SDK 1.716.0-preview: Tight Alignment of Resources - DirectX Developer Blog (out of preview in 1.618 Agility SDK)

To avoid potential bandwidth bottlenecks, resources may live in graphics memory even if not used by the current frame. Profiling a complex PC game with a lot of buffer allocations showed around 2GB reduction in allocated and unbound graphics memory:

Tight buffer alignment is supported in the upcoming Unity 6000.4.0a3 and backported to 6000.3.0b7. Major GPU vendors have been rolling out support for this flag. Before testing this optimization, we recommend making sure your graphics drivers are up to date. You can fetch the latest Nvidia drivers here: NVIDIA GeForce Game Ready Drivers . AMD GPU support is also available in the latest derivers (Adrenalin Edition 25.9.1 and newer).

if your project exhibits significantly higher graphics memory usage when moving to DX12, we recommend using the new DX12 Device Filtering setting to increase the minimum driver requirement to more recent versions. This will ensure you are running DX12 on devices with Tight Alignment support. (To read more about this, check the recent Graphics device filtering for Vulkan and DX12)

Improtant note: The RenderDoc debugger does not support Tight Alignment as of yet. Enabling a RenderDoc capture (through the Editor’s capture button, or by using RenderDoc directly) will disable the Tight Alignment flag. You will likely see higher memory usage as a result.

Optimized scratch buffer for DX12

We also optimized the DX12 scratch memory in Unity 6.2 by implementing smaller buffer allocations with less padding. Frame tracking is also introduced in order to delete unused buffers and free memory. This optimization is observed to reduce memory usage by around 7-25% in our tests

Unity 6.4 will introduce additional optimizations to reduce the CPU overhead associated with smaller and more numerous buffer allocations. In our tests, we observe a CPU time reduction of up to 80% for buffer allocations. This upcoming change will also introduce new command line arguments which could be used to fine-tune the DX12 scratch buffer:

  • -d3d12-min-scratch-memory’ controls the minimum amount of scratch memory the allocator will retain regardless of usage. Defaults to 32 MiB.
  • -d3d12-min-client-scratch-memory’ controls the minimum amount of scratch memory for dynamic vertex buffers allocated from the main thread. Defaults to 3 MiB.
  • -d3d12-scratch-release-delay’ controls how many frames a buffer has to remain unused before it will be freed. Defaults to 200 frames.

Ray tracing memory usage

Ray Tracing can lead to a big increase in graphics memory usage, as we need to build and upload the Ray Tracing Acceleration Structure (RTAS) to GPU memory. Previous Unity 6 releases introduce important optimizations to the RTAS which lead to significant reductions in memory usage:

  • BLAS compaction reduces memory usage for static meshes
  • A new small-BLAS allocator reduces memory usage for small meshes and detail
  • The “Minimize Memory” flag can be set on a per-renderer basis to further reduce memory usage

We measured these optimizations with large scene of around 6.7 million triangles:

To benefit from these optimizations, make sure you set the Mesh Renderer Ray Tracing Mode to “static” and also enable the new Minimize Memory flag:

Native Render Pass for Vulkan, Metal and DX12

Memory traffic can have a big impact on GPU performance, especially on mobile devices with very limited memory bandwidth. Transferring data between the GPU and system memory is also a big contributor to battery consumption and thermals. With a lack of active cooling, most mobile devices will quickly throttle the GPU when things get toasty.

A common contributor for per-frame memory transfers are framebuffer load and store operations. This can be a critical bottleneck for mobile and XR devices which use a high display resolution. A full HD image with 3 load/store operations can result in ~48 MB of transfers per frame. This overhead will grow with pixel resolution, the number of render textures and render passes:

Modern graphics were designed to address this problem with the introduction of Native Render Passes (NRP). Originally supported on Vulkan and Metal for mobile, this is now extended to DirectX12 for Windows on ARM. The NRP C# API allows the Universal Render Pipeline to merge compatible render passes together, significantly reducing memory transfers. With our previous example, we can eliminate 2 load/store operations for a ~66% reduction in framebuffer transfers:

Native Render Passes are integrated with URP via the Render Graph system in Unity 6. URP is now able to minimize the use of framebuffer load/store operations to improve efficiency and performance. A good example is with Deferred rendering, where we can avoid a large number of load/store operations for the GBuffer textures. Resulting in over 60% reduction in GPU transfers to/from memory.

It is recommended to check your render pass configuration with the new Render Graph Viewer tool (Window → Analysis → Render Graph Viewer), to track the number of render passes and render textures. Certain renderer settings can have a big impact on bandwidth usage. For example, setting the Depth Texture Copy mode to “After Transparents” can greatly reduce bandwidth usage when using the Depth Texture. To learn more, check the URP manual:

We are extending the Render Graph system to support additional rendering features, like the new “On-Tile Post Processing” renderer feature for Quest (Vulkan) in Unity 6.3.

On tile post processing for URP on Quest 3 (MSAA, x1.5 Render Scale, HDR)

Tracking and and analyzing memory usage

If your project suffers from lower GPU performance when transitioning to modern APIs (DX12, Vulkan, Metal) we recommend you check your application’s graphics memory usage.

A good entry point is Unity’s own Memory Profiler (Window → Analysis → Memory Profiler), which provides a rough estimate of the total memory used, along with the portion used by the Graphics Device. This can be an effective way to validate the impact of changes in the Engine and your projects:

You may notice that a large sum of native graphics allocations can show as “Untracked”. We are currently unable to track these allocations in the Memory Profiler, but are looking into improving on this in future releases.

For more accurate and detailed data, we recommend using 3rd party profilers provided by GPU vendors. A good example is Nvidia’s NSight Systems, which allow you to track VRAM consumption and memory traffic over time. AMD’s Radeon Memory Visualizer allows you to easily track graphics memory allocations and their associated resource type:

For mobile, we recommend using tools like Apple’s XCode Metal Debugger (iOS/macOS) to track memory related metrics:

Please try the latest optimizations and share your feedback!

34 Likes

Shouldn’t this discussion be marked as “Official”?

2 Likes

What a great breakdown and overview. If this all works as stated above, kudos to the team behind this. Great achievement! Personally love the new statistics.

4 Likes

Yes, this is the kind of improvements we need to see on every unity version! foundation level improvements are essential and overhead is real, thanks for that and hope to see more in incoming releases!

7 Likes

Absolutely! We fixed it.

2 Likes

It’s nice to see memory optimizations for DX12! I appreciate those a lot :grinning_face_with_smiling_eyes:

Will xbox benefit from tight alignment too?
It would be great if it was backported to 6.3.

Is Vulkan/Metal + Native Render Pass now the recommended way to deploy on mobile? Last time I tested it, some Android Devices exhibited a black screen with native render pass

1 Like

Will this come to HDRP, too? Isn’t the render graph system unified in Unity 6.3 already?

1 Like

It should be yes, but vulkan has issues on older Android devices due to poor drivers. You can set up a filter for which devices do and do not use vulkan now.

It’s nice to see memory optimizations for DX12! I appreciate those a lot :grinning_face_with_smiling_eyes:

Will xbox benefit from tight alignment too?
It would be great if it was backported to 6.3.

The tight buffer alignment flag is available through Agility SDK and PC only. But this should be less of an issue on Xbox, which runs on a known/defined architecture.

Support for tight alignment is now back ported to 6000.3.0b7!

6 Likes

Is Vulkan/Metal + Native Render Pass now the recommended way to deploy on mobile? Last time I tested it, some Android Devices exhibited a black screen with native render pass

Native Render Pass is now used by default with the new Render Graph system for URP. Please share more information and submit a bug report is you experienced issues!

Will this come to HDRP, too? Isn’t the render graph system unified in Unity 6.3 already

Native Render Pass merging is currently limited to URP, which is the recommend render pipeline for mobile devices that benefit from NRP and tile-based rendering techniques.

2 Likes

Is there any guideline as to which devices to filter? I would assume Unity will automatically decide between Vulkan and OpenGL3 based on the device support?

Not too sure tbh (I mainly dev for Quest)
But you can allow any device over a certain Android OS version or vulkan API/driver version, which likely will be quite broadly applicable

I would assume Unity will automatically decide between Vulkan and OpenGL3 based on the device support?

That is correct. The Vulkan graphics device applies device filtering code to pick the most optimal graphics API, falling back on OpenGLES when running on some older and more limited devices.

This mechanism (“device filtering”) is now exposed via the player settings in Unity 6.1, and can also set the optimal threading mode Unity - Manual: Introduction to Vulkan Device Filtering Asset. This can be used to extend Unity’s gfx device grading logic.

Unity 6.3 also extended this system to support DX12 (back ported to latest 6.2 versions!) Unity - Manual: Introduction to D3D12 Device Filtering Asset. So we can automatically fallback to DX11 on older PCs which may not benefit from DX12 (and/or may suffer from stability issues due to outdated drivers).

We will soon share a forum post with more details about graphics device filtering! stay tuned.

6 Likes

Not sure if this is a bug or an issue with memory management since the unity 6.4 alpha release, i upgraded my project, and i get hit with “Vulkan - Suboptimal memory type used for buffer because of low memory” than before on 6.3 beta 5. On DX12 there is no issue of memory buffer being low.



Edit: I ran my project in 6.3 beta as Eight and GPU memory usage never went full 7.4GB+, until the upgrade of Unity 6.4 alpha, GPU memory usage, no matter the global mipmap limit doesn’t decrease, if set from Eight, to Full and back, unlike in 6.3 Beta, the issue didn’t exist, and switching them would cause the GPU memory to drop, as seen in the screenshots,

global mipmap limit does something in 6.3 beta,
6.4 unity the option does nothing or very little.

Excited to see on tile post processing, though the (understandable) limitations around it make for some tough choices on whether to use it. Are you collecting metrics for this at more common render scales (not 1.5x)? And which level of MSAA was used for these tests? Thanks!

1 Like

Thanks for posting this.