Entities Graphics on Mobile, Tile Binning cost is higher compared to not using Entities Graphics

Is there anything that can be done (any tweaks to buffer sizes etc), or a reason why the Tile Binning costs would be higher when using Entities Graphics ?

I'm testing the same scene with a fixed camera view, regular URP approach, and then a version that has all the MeshRenderers in a Subscene so that it uses Entities Graphics. Also using Vulkan for both.

I can see that with Entities Graphics, the DrawSRPBatcher cost and overall RenderLoop cost is lower (~3.5ms), yet using Perfetto the Tile Binning takes longer and the app no longer performs at framerate. (18 ms total)

Without using Entities Graphics, DrawSRPBatcher and the RenderLoop costs a bit more (~4ms), but Tile Binning costs less and the app runs at framerate. (12 ms total)

Which hardware are you running on? When running with Entities Graphics, what kind of instance counts do you see in the Frame Debugger? What kinds of shaders do your objects have? If you are using URP/Lit, I suggest trying a Shader Graph instead in case it helps.

@JussiKnuuttila thanks for the response, this is on Quest2
I'm using a basic Lit ShaderGraph for all materials in the test scene.

There are 3 Hybrid Batch Groups, 2 only have 1 draw call, but the 3rd main group has 150 DrawInstanced Calls & 150 instances, 664,225 vertices & 1,477,359 indices (492,453 triangles)

from profiler also:
SetPass Calls: 4
DrawCalls: 153
Batches: 150
Triangles: 500.2k
Vertices: 674.4k

I realize it's a lot of geometry, but the scene is performant without BRG / Entities Graphics (around 10-11,000 APP_T), so we were doing a test to see how much more perf we could squeeze out of it if using BRG.

It did definitely reduce the CPU cost of issuing the drawcalls in the Unity Profiler, the only thing on the Unity Profiler side that went up in cost is the EarlyUpdate.XRUpdate (from ~9ms without Entities Graphics, to ~13ms with). It runs at around 16,000 APP_T with BRG

So then I looked in Perfetto and noticed to tile binning seemed to cost more overall (I'm not sure if that is from BRG, does it access meshes in a different way on the GPU?)

BRG does not access meshes in a different way, but it does access transform matrices differently (through an SSBO on Vulkan, instead of loading from a regular UBO). I don't know exactly how the tile binning works, but it probably needs to access the transform matrix so it could be affected by this difference.

Trying out GLES might be a worthwhile experiment. BRG on GLES uses a UBO based code path instead of SSBOs, so it could have different performance characteristics on some hardware, although it does instance ID based loads from the UBO so the actual difference is usually small on the Android HW that I have tested.

I see from the numbers that you have about 150 draw calls, and about 150 instances, which sounds like your instances are being rendered one by one in single instance draw calls. Unless this is expected (e.g. every instance has a unique mesh/material combination), it might be another worthwhile experiment to see whether those could be batched together in a smaller amount of draw calls, which typically improves Android GPU performance especially on Adreno-based hardware.

@JussiKnuuttila thanks again!
I tried with GLES31 and it is at framerate (about 11,000 APP_T) The thing I noticed is on the CPU side rendering is slower, taking closer to 2ms for the DrawOpaques section (whereas on Vulkan without BRG it was about 1ms, and with BRG nearly half that)
Also the profiler overall on GLES31 is more "unstable", fluctuating above and below 13ms. On Vulkan it was much more stable and flat (both with and without BRG)

I also tried a test of not using mesh combined geometry, and it is still at framerate at 12-13,000 APP_T but definitely stressing the GPU as its at 99% usage

When not using the mesh combined geometry, there are 13 Hybrid Batch Groups. There are 2 big groups:
one with 257 DrawInstanced Calls & 1709 Instances, and another with 235 DrawInstanced Calls & 1568 Instances
The rest are small with ~10 instances or less

AFAIK Tile Binning runs the vertex shader to check what tiles geometry covers, but I also don't know much about it. I guess it seems reasonable that SSBO's might be slowing down that process on this hardware, but also don't have insight into why. As far as I could see from the Perfetto traces it was just the binning process that seemed huge.

Also with BRG and GLES31 the binning was back at 12ms which is what I was seeing on Vulkan without BRG

This definitely sounds like SSBO vs UBO could be the key difference GPU performance wise here, which is definitely a surprising result. Would it be possible for you to share a repro project?

In scenes that have many copies of the same mesh (as you seem to have), combining meshes is likely not as profitable when using BRG, as it will increase mesh sizes and the amount of memory that the GPU has to load.

I would also suggest checking how many Hybrid Per Instanced properties your Shader Graph has. On PC type hardware they don't have much extra cost, but on Android enabling this setting increases the GPU load of the shader. If there are unnecessary ones, disabling the setting might speed the shader up.

@JussiKnuuttila thank you, I will get a support ticket made that I can provide a repro project with

Quick update here. I investigated this ticket recently and nailed down the issue to the way we sample lightmaps with DOTS_INSTANCING_ON shader variants.

For some reasons, it seems that dynamically indexing the lightmap Texture2DArray is causing massive GPU performance drops specifically with Vulkan and Adreno devices. It doesn't repro with either GLES or Mali GPUs.

Quest 2 seems to be the most impacted device out of those which were tested. We are talking about a 35% perf regression of the total GPU frame time on basic scenes here.

The problem has been forwarded to Qualcomm and they were able to reproduce the issue internally. We are waiting for an update on their side. It's very possible it's a driver bug.

In the meantime, I'm not sure there is an easy workaround. If you are always using only one lightmap, you could technically just turn
#define unity_LightmapIndex UNITY_ACCESS_DOTS_INSTANCED_PROP(float4, unity_LightmapIndex)
into
#define unity_LightmapIndex 0
in "Packages\com.unity.render-pipelines.universal\ShaderLibrary\UniversalDOTSInstancing.hlsl".

That's obviously far from ideal, but I thought it would be worth mentioning. Other than that it probably just needs to be fixed at the driver level.

3 Likes