SRP Batcher vs GPU Instancing

I realise that this is a topic that may have been discussed here many times already, and for that I sincerely apologise for bringing it up again, but I’ve been doing some experiments recently on comparing the SRP Batcher against GPU Instancing and I have some thoughts on the matter that I would like to share.

The reason for my investigation in the first place was spurred by the fact that I need to render many meshes that also require per instance material data. One of the well known methods to set per instance data, in an efficient manner, is to use a MaterialPropertyBlock. However, as mentioned in many threads, and in the Unity Manual docs itself, using a MaterialPropertyBlock breaks the SRP Batcher by making it to revert to individual draw mesh calls. To remedy this, I discovered that the only method in which per instance material data can be set whilst still using the SRP Batcher involves creating an instance of the material and apply the material update there.

I ended up performing two tests. For both tests I created a script that would instantiate 10,000 cube meshes that were all evenly spread out throughout the world. Every four seconds, I would update a colour material property for each cube instance with a random colour. For the first test, I would use the SRP Batcher with the material clone technique to update the individual material property. For the second test, I used a shader that added an Instanced Property, which in turns disables the SRP Batcher compatibility, and enabled GPU Instancing. The instanced property was updated via a MaterialPropertyBlock. The results of the tests (performed on a Samsung Galaxy S7 International Edition, which uses the Exynos CPU) were as follows:

SRP Batcher:
Render Opaque Main Thread TIme: 18.37ms
Render Opaque Render Thread Time: 82.05ms

GPU Instancing:
Render Opaque Main Thread Time: 8.55ms
Render Opaque Render Thread Time: 7.88ms

As you can see, a very stark difference!

After an extensive search on the topic on the forums, the general opinion is that the SRP Batcher is intended to be a replacement for all existing render paths, which includes GPU Instancing. This is somewhat evidenced by the fact that GPU Instancing and SRP Batcher are not compatible. However, my results show that GPU Instancing has its uses from a performance perspective. The other issue being the memory usage of instancing materials for per mesh instance material data. In my test I generated 26mb of material data since there 10,000 instances of the materials generated.

The ideal scenario here is that a combination of the two technologies are used, in that if there is a single instance of a mesh used then it should use the SRP Batcher, and then as soon as there are multiple instances of a mesh, it switches to using GPU Instancing. However, given the different shader requirements that are needed for both paths, the only way this can be done, as far as I know, is by managing the switch yourself, most likely via a shader switch.

So the main question I have on this is: are my thoughts on the SRP Batcher/GPU Instancing usage correct? Also, am I correct with regards to method in which I set per instance data with the SRP Batcher? Or is there better method that I am not aware of? Lastly, are there any upcoming changes that help to improve the performance of instanced meshes with the SRP Batcher? (The last question is mostly directed to anyone in Unity).

Curious to hear what others think of this. Many thanks!

3 Likes

SRP Batcher is not a replacement for instancing. That’s a misconception caused by terrible branding.

All it does it reduce the amount of render state setup commands Unity does between draw calls (which is horribly redundant by default), but it does not (and cannot) “merge” draw calls. If you have 10k objects, you’ll still produce 10k draw calls. They might be slightly cheaper than “standard” Unity draw calls because there will be less commands between them, but it’s still 10k calls to the GPU driver to draw stuff, and there’s no making that any cheaper.

If you’re in the business of drawing hundreds of the same thing, instancing and/or mesh merging are necessary.

7 Likes

Good point
People should read a bit more about SRP Batcher there’s too much confusion

VS an answer from an Unity Engine Programmer (not marketing guy)

“because in real life scene, batcher is slightly faster”.

Yeah… right.

It’s 2020 and while UE4 finally got automatic mesh instancing a couple versions ago, Unity takes it away in their “modern” renderer.

2 Likes

Unity is trying to solve this problem by using the new Hybrid Renderer V2, which can directly transfer instanced data onto gpu from entity. It’ll be the standard scene format in the furture. Dunno when it’ll become real production ready, though

I wouldn’t say the topic of SRP batching vs GPU instancing has been extensively discussed on the unity forums, or perhaps I am bad at finding information. I’ve been looking for a lead on how to disable SRP batching per shader, and this thread is the first lead I found! Thanks!

could you share this benchmark project?

This kind of test is pointless: obviously instancing is faster, you’re drawing 10.000 identical cube mesh, that’s exactly what instancing is for.
But in what game do you have 10.000 cubes with the same material?

SRP batches is a lot more versatile in real world use cases. Simply use both considering what you need to do: if you need to render 10.000 cubes, use instancing…

cube=Asteroids,trees, mountain details, city windows…

Yes, but you need to draw -a ton- of them.
That’s why the test is biased: drawing 10.000 identical meshes is obviously the perfect case for instancing. It makes no sense to test it against the SRP Batcher.

Profiling an actual in game scene is the only way to know what’s best for your project. And you can use them together.

1 Like

I’ve been meaning to make an update on this thread for a while, and given the recent activity I guess now is a good time as ever! Since several comments have been made since the original, I’m going to address an older one first.

My point of the SRP Batcher being the replacement for GPU Instancing was based on the incompatibility of the two systems. The current implementation for GPU Instancing, in particular with regards to setting per model instance data, requires a shader set up that will cause the SRP Batcher compatibility to break, mainly due to the fact that you need a specific constant buffer set up for it. Given the fact the future of Unity rendering lies with URP, HDRP, and just the general SRP system, one can surmise that the current GPU Instancing solution is deprecated. @Bordeaux_Fox also provided a quote from a Unity dev somewhat confirming that there is a belief that the SRP Batcher should be performant enough to not require GPU instancing in general. However, my tests, as extreme as it was (will touch upon this later), does highlight that a need for GPU instancing may still be required.

I understand the point you are trying to get to here, but I disagree with the comment that the tests are pointless.

The purpose of using such a high number was to stress test the two systems, and to ensure that the numbers that I get from the profiler show the difference in an obvious manner. Given that systems like these typically, in my experience, but should also qualify not always, scale linearly, the large number helps to highlight the difference more than it may possibly do with smaller instance numbers. In no way was to replicate a “real” world scenario of rendering.

Either way, if you still feel the test is still pointless then that is ok. I still would like imagine that there might be others that might find the information useful.

Getting back to the main discussion itself, I’m planning on doing some further tests to determine what is the tipping point at which GPU instancing starts to outperform the SRP Batcher. if the number ends up being in the 100s, then I would argue that the SRP Batcher is a perfectly acceptable replacement for GPU instancing for most use cases (situations that require many 100s of instances would still be better served by GPU instancing, as already pointed out). However, if it takes 10 instances for the GPU instancing to outperform the SRP Batcher, then that is something to be concerned about.

Moving onto the second point that I made in my original post, regarding setting per instance data, I still have not been able to find a way to achieve this without having to create an instance of the material for each model instance in order to not break the SRP Batcher. I am aware of the UnityPerDraw Constant Buffer block, but my shaders make use of the UnityInput include file, which in turns adds Unity’s UnityPerDraw definition, and so I cannot override that. The main downside of having a material instance per model instance is that it is rather wasteful in memory, especially since material memory sizes can go somewhere between 1-2kb. If anyone has a better way of setting per instance data without breaking the SRP Batcher, I am all ears!

3 Likes

Where is machine learning when you need it? I don’t want to have to decide what the best option is. I’d like the editor to figure that out for me. That’s what “performance by default” should mean.

5 Likes

We’ve done a test with our game scene, using instancing got about half the draw call count as SRP Batcher did, but the framerate of SRP batcher is significantly higher than using instancing.

Like other people in this thread said, the perfect use case for instancing is you have a lot of identical meshes and materials, for this part, instancing wins without a doubt. But for normal use case, like a complicated game scene, SRP batcher wins, even if there are lots of identical objects(with 20-40 instances each).

For some special cases, like grasses, which need thousands of identical meshes with identical materials, we would use CommandBuffer.DrawMeshInstanced, along with srp batcher for other objects to achieve the best performance

What’s the strategy for hundreds of skinmesh?

If you are going to compare instancing to SRP batcher you need to use instanced indirect with the latest api’s for uploading data, like begin/end write. That puts instancing more on par because you are using api’s that persist data and upload it as efficiently as possible.

So once you narrow things down to what SRP batcher can do that you can’t do with instancing, it’s a more level playing field.

But the best approaches come with a cost also. Really the biggest gain for most games is going to be the persistent data on the gpu. You get that without having to implement instanced indirect and culling and all that comes with that. Having multiple materials same shader that’s nice but also relatively easy to optimize out in most games.

The downside to batcher is if you want to use it in code the ap’s are pretty buggy still. And there is a significant hit from using gameobjects vs the api’s.

We use instanced indirect for most stuff because once you factor in gameobject overhead and the state of batcher api’s it still wins out for a pretty big surface area. Even if you don’t have high instance counts.

We use SRP batcher for one thing, character clothing. It’s a specific edge case where it does really well.

1 Like

LOD strategies, mostly. Reduce bone counts, reduce bones per per vertex, use “legacy” Animators instead of AnimationControllers. If you use cascade shadows, disable shadow casting for distant units.

There are techniques to use GPU instancing with skinned meshes, but those are pretty hardcore and have you basically rolling your animation system using shaders, by storing animation data in textures and animation parameters in buffers which you can index using SV_InstanceId. Those are necessary if you’re going beyond 200 characters.

There’s also imposter techniques, for truly large number of units (several thousands). In those you only render a select few dozen units into render targets, then render distant (or all) units as 2D sprites using those render targets as textures. The biggest limitation is that you’ll have a lot of units sharing the exact same animation frame. This only works well in games where the units are seen from very far, and you need some clever coding and design to make it less obvious.

4 Likes

Bushes, rocks, trees etc, your regular fantasy / story type 3rd person game will have anywhere from hundreds to hundreds of Thousands of each and that’s when the level designers are commanded to behave themselves.

And you don’t need thousands for instancing to be faster, depending on shaders, meshes etc the breakeven between them is somewhere between 50-150 draws.

This in a project where absolutely every shader in are optimized for SRP Batcher, we are in the process of adopting a best of both worlds approach where stuff that exist in small amounts and would require customization for proper batching will work with SRP Batcher (what it’s Great at actually). And instanced indirect for stuff that exist in their hundreds and doesn’t move (no reason they couldn’t move, just particularities of the project where we can gain extra performance since not too many things actually move).

While doing indirect for absolutely everything that isn’t completely insignificant would likely net you performance and memory, you also have to balance time spent on developing something else that would likely net greater gains, obviously depending on project particulars.

Where is Nanite when you need it!

It’s in a different engine.