BatchRendererGroup on mobile

Anyone know if there are particular settings/functions/etc to use so BatchRendererGroup work on Android?
I pretty much copy-pasted “BRGSetup” from Unity’s own repository, and it work perfectly fine in Editor (or build for PC). However on actual mobiles nothing is displayed - outside of the regular meshrenderer’ed ground.

Tried with an OPPO A16s, a Samsung Galaxy S6, even with Bluestack and Nox just in case.
Scenes with regular MeshRenderer and DrawMeshInstanced both work fine, but not the BatchRendererGroup one.

It’s Unity 2022.2.12, so it “should” be compatible - but it doesn’t even raise an error.

1 Like

Make sure to set the “BatchRendererGroup Variants” setting in Project Settings > Graphics to “Keep All” when using BatchRendererGroup, otherwise player builds will be missing the required shader variants.

If this doesn’t help, please try to run the application using both Vulkan and OpenGL to see if one of them works.

Thanks, but “Keep All” was already on, since it wouldn’t work when Built for PC otherwise.

The fix was to use the second overload of BatchRendererGroup.AddBatch; the one that take an additional offset and length.
Both must be set to 0 for regular platforms (when “BatchRendererGroup.BufferTarget” equal “RawBuffer”), and settings them to 0 & 16384 respectively made my scene work on all the aforementioned phones except the OPPO.
16384 was picked for being the lowest “SystemInfo.maxConstantBufferSize” among all tested phones (the Samsung).

I still have no idea what those values actually do (outside of the obvious “being an offset and size for some buffer somewhere”). I guess it’s neither the GraphicsBuffer nor the Metadata array since they are already arguments, but Googling only return info about shaders that I’m not really sure are the same things.

At least it’s working now, and working much faster than MeshRenderers&DrawMeshInstanced, so I guess the arcane of “why” it’s working can wait.

BRG is compatible with GLES3.1 where SSBO can’t be used as main GPU buffer. When running on GLES3.1, you should use UBO buffer to store BRG main GPU data. There is no size limit to UBO, just a “visibile window size” limit. that’s why we added offset & window size parameter. For instance, you can create a 64MiB UBO and sub-allocate. In this case you will specify the start offset (this offset should always be an interger multiple of BatchRendererGroup.GetConstantBufferOffsetAlignment )

Also, instead of using hardcoded 16384 value, you can use BatchRendererGroup.GetConstantBufferMaxWindowSize as a second argument and it should work on all your GLES3.1 devices.

In the Unity2022.3 version, BRG will have the problem of being unable to render on some devices. This problem will occur when I use the official Demo (brg-shooter) to package my own code. The device with the problem is Samsung Galaxy S5, which is a Old equipment, but it should support GLES3. I tested 2022.3.14f1 and 2022.3.18f1 respectively, and this problem will occur. When BatchRendererGroup.AddBatch is called, an error will be thrown: BatchRendererGroup.AddBatch windowSize parameter (16384) is over the limit allowed by the Graphics API (maxConstantBufferSize: 0)

Galaxy s5 uses Adreno 330, and despite GLES3 (which is also not always a metric, shader model should count too, but that’s a bit off topic) Unity disabled constant buffers support (which is required for BRG) for devices with Adreno 3xx.

Thanks for your enthusiastic answer, my problem was successfully solved. I can determine the device platform to bypass this problem.

@arnaud-carre
Why is it not possible to use UBO on PC or on platforms where SSBO are supported, and why it’s not possible to specify offset+size parameters on PC?
This is pretty limiting to be honest. DOTS performance is not that great and require a lot of setup and indirection, so when I want to avoid that by using custom shaders, there’s just no option to do that correctly.

Same for structured buffers, there’s no reason to limit buffer support to just Raw Buffers and don’t allow Structured Buffer.

If I just want to implement usual instancing it’s not possible to do efficiently with the current approach, and DOTS is not the best solution for instancing either.

Also non-instanced draw calls seems to be using visibleInstances buffers passed with UBO, why don’t just allow to bind custom data per draw call then?
Since it’s not possible to bind UBO with granularity less than 16 anywhere, and on nvidia it’s even larger (256), why don’t just allow users to write any custom data they want within arbitrary aligned amount of bytes?

BatchRenderGroup seems like a very well done architectual piece of engineering, but implementation limitations just makes it much worse.

1 Like

Hi!

Why is it not possible to use UBO on PC or on platforms where SSBO are supported, and why it’s not possible to specify offset+size parameters on PC?

The main reason is to avoid adding a specific shader keyword (like UBO or SSBO mode for BRG) that would double shader variants & compilation time.
What’s your limitation with SSBO on PC?

Same for structured buffers, there’s no reason to limit buffer support to just Raw Buffers and don’t allow Structured Buffer.

I agree Structured buffer is nice syntax but from hw point of view it’s similar to a raw SSBO buffer. That’s why we chose to only support raw buffer.

If I just want to implement usual instancing it’s not possible to do efficiently with the current approach, and DOTS is not the best solution for instancing either.

I surely miss some details about what you try to achieve with BRG. Do you have more information about that? What would prevent BRG to do instancing efficiently in your specific use case?

Also non-instanced draw calls seem to be using visibleInstances buffers passed with UBO, why not just bind custom data per draw call then?

I don’t know if it could help in your specific case, but you can also create several different SSBO. You can pass each SSBO buffer as an argument to the BatchRendererGroup::AddBatch function.

Hope it helps

1 Like

Hello, thank you for the response.

While a lot of devices supports SSBO, it doesn’t mean that it’s the best option for that platform, and even more it doesn’t mean it’s a best solution for a specific case (again, changing the way it works on engine side will allow to still implement DOTS, but will also allow implement broader amount of solutions).

While it’s true that Raw and Structured buffers are simillar, raw buffers doesn’t allow driver to optimize load alignment, which is pretty much the case with Structured buffers.

Lifting off the limitation on binding buffer+offset (which is supported on ALL platforms) per fat draw-call (visibility data/batch data) allows to impliement faster GPU culling, without duplicating materials (overhead) and without creating a lot of buffers (even more overhead).
Given that Unity still doesn’t support bindless for some reason, it’s needed to invoke a lot of dispatches (which are also have immense CPU overhead in Unity) to cull multiple buffers, while with buffer+offset scheme it’s totally possible to cull everything in just a couple of dispatches.
Same goes for per-instance (or batch) data. Especially when adding draw instructions for lights, where multiple fat draw calls are needed, because it’s otherwise impossible to specify separate per-split instances.

Also MDI is still not supported while it’s just a matter of hours to implement with native plugin, but ofc native plugin API is limited and can’t access Unity’s compiled PSOs, which require to use tons of hacks.
I mean there’s a base instance id problem on a lot of DX12 devices, but nobody asking to support something unsupported on the current platform, cmon. Having a limited option is much better than have none.

I also don’t understand the limitation on per-fat drawcall (cpu’s one) data. I can only provide 4 bytes per instance (visibility id). Why it’s not possible to provide custom per-instance data, if visibility instances data is still written from CPU to GPU (16B per instance is written). Could be just a nice API with any arbitrary data.

Per-fat-drawcall SRV bindings would be an OP to use, for the use cases where custom light/refprobe/lightprobe/other volumes make sense (I have some people suffering from that and it require tons of effort to do with hacks). Same for per-fat-drawcall const buffer bindings.

The solution I’ve implemented used to make instancing not on per-material basis, but contrary to GRD more closely to per-instance properties, and I use AoS instead of SoA (contrary to DOTS) for better cache coherency and less overheady loads (less indirections) since texture load cache is not infinite.

Also Unity for some reason doesn’t support copying part of the buffer to another part of the buffer, which is supported across all platforms existing nowadays which support buffers.

P.S. I’m sorry if my critique sound harsh, it’s mostly because of frustration and knowledge that there are more far better options in a form of low-hanging fruits, which wouldn’t take long to implement.
I love the engine and want to prosper and be used more frequently by higher grade of developers.

2 Likes

sorry what would you like in BRG as an API for binding buffer+offset? Do you mean you would need a AddBatch( metadata, buffer, buffer offset, windows size); also for SSBO mode? OR are you thinking of another use case? ( at DrawCommand generation time? )

Regarding 4bytes per instance in visibility buffer; we could add more user data in the 16 bytes per instance you’re right. We may plan to extend this API. Could you elaborate about your use case? how more data would you need? ( we can’t promise anything about extending this but it helps if we can gather user requests and use case )

Also Unity for some reason doesn’t support copying part of the buffer to another part of the buffer, which is supported across all platforms existing nowadays which support buffers.

You’re right, we’re thinking about adding such a feature too. Can’t promise anything but it’s a legit request.

I’m sorry if my critique sound harsh, it’s mostly because of frustration and knowledge that there are more far better options in a form of low-hanging fruits, which wouldn’t take long to implement.

no problem, it’s nice to get practical feedback about an API. Btw, things take often longer time to implement than expected because of the generic game engine nature of unity. Also because the wild platform range and retro compatibility.

1 Like

Yeah, just AddBatch with offset+size, and BatchDrawCommandIndirect.visibleInstancesBufferWindowOffset+size to work. Would be extremely nice to have CommandBuffer.SetGlobalBuffer/Material.SetBuffer.

My experience mostly come from “ideal” perspective on how I can build a Custom SRP from the native gfx API perspective for performance/flexibility, seeing missing features which I know totally supported everywhere across gfx APIs is frustrating :frowning:

I have a lot of feedback regarding SRP API and/or Unity gfx API limitations, which probably wouldn’t be a complicated task to fix/extend, so I can share it if needed.
Some of which are:

  1. Per-render-frame callbacks. Right now SRP is calling C# side per-render-target, which complicates sharing resources like lights. Otherwise it could be possible to just push all light culling for all cameras to render, render shadowmaps once etc without needless workarounds like hashmaps or other hacks. This also extremely complicated to determine frame end in Edit Mode time, because there’s no link between scripts update loop and redraw. Useful for custom texture streaming. I had to implement that with hacks for @guycalledfrank 's game.
  2. Replacing native pointers for Texture2D/Texture3D/Cubemap etc. Right now external is only implemented in C# for Texture2D, while it’s totally possible to do for 3D and/or cubemap with unofficial calls, but it require finding engine function pointers though PDB/.so symbols and seems really hacky.
  3. Setting native gfx pointers and/or GraphicsBuffer for at least Mesh wrappers, so it’s possible to draw really custom mesh. For instance getting skinned mesh buffer and bind it to Mesh object to render it again within the frame ofc. I had to implement that with hacks for @guycalledfrank 's game.
  4. Add structured buffer reflection additionally to const buffer reflection for Shader Infos. Probably even make shader info a runtime thing instead of editor-only, but it’s easy to fix from C# side.
  5. “State” properties, or “special” properties reflection for materials which affect render state (like PSO) and couldn’t be batched together. Right now to detect those kind of props I need to manually add them. Actually having a system for detecting are materials “batchable” between each other via their bindings hashes would be nice. ComputeCRC() computes the hash of all of the properties, not only state-breaking.
  6. Bindless, of course. I’ve implemented MDI and Bindless for DX12/Vulkan with hooking DX12 functions in the native plugin (used in TrueTrace now), so it’s definitely a hacked approach, which won’t work on consoles. It’s not needed to be a perfect solution with state tracking, residency management etc, because it can be delegated to the user (programmer) side. Just having SetBindlessResources(Texture2D[] resources, int startIndex) and in-shader support would be already a HUGE boost. Even if I have to track resources myself etc.
  7. API to retrieve native shader/PSO handles to use in native plugins, otherwise it requires really complicated setup to interop with Unity including reimplementing ShaderLab parser sometimes, gathering variants, compiling them again, etc etc. Especially complicated with consoles.

Thank you again for the response, really appreciate that!

1 Like

Hello, @arnaud-carre are there any estimations on offset+size BRG bindings? I wanted to implement GPU-side culling so it would be great to create few big indirect args buffers and use them with offsets, now it will require to create the new buffer for each fat draw call, which seems really suboptimal.

Also it’s a huge downside that all the buffers must be allocated with TempJob allocator, and not something else. It makes it impossible to cache reused data. Like if the light wasn’t moved etc.
For big scenes those amounts can be up to 4MB per light, which makes CPU really busy just copying bytes for no reason.