Hi, I’m updating from an older version of RenderMeshSystem to the latest and find that RenderMeshSystemV2 spends ~80ms to cull all the objects because I have a lot of them. I understand that in MegaCity culling is a must because the meshes are complicated. But in my situation meshes are quite simple so hw culling is enough and not worth it to cull on CPU. Are there ways to turn off the software culling or tags/comps to directly draw entities?
I, too, am curious about this. I’ve been trying to use Disabled tags to get RenderMeshSystemV2 to ignore some of my entities, but this doesn’t seem to have much of an impact on performance.
Anyone figured this out? I know there’s the FrozenRenderSceneTag component, but that only works for entities that don’t move. I’d like to never cull even moving entities, because just like OP, my meshes are simple.
My understanding was that any entity with a WorldRenderBounds component would be checked for culling, so removing this would solve it.
Removing either WorldRenderBounds or ChunkWorldRenderBounds has no effect, it’s automatically added back in by some backend system. Only by removing LocalToWorld do those two components go away, but if I remove that, then the entity will no longer be rendered.
I took a look at CacheMeshBatchRendererGroup() method inside the RenderMeshSystemV2 class. That’s the method that’s causing this huge culling lag. I tried modifying it ever which way for about an hour before I gave up with no good results.
All I want is to have no culling… It’s the sole reason why RenderMesh is performing so badly at a medium scale. Surely rendering 20k entities with a simple quad is faster than doing this time consuming culling operation on the main thread.
Why we trying to hack this. Can’t you just disable the system if you don’t want it to run.
this.World.GetOrCreateSystem<RenderBoundsUpdateSystem>().Enable = false;
That system is not the problem, but I tried your suggestion nonetheless. Nothing changed.
Like I said, the killing CPU method is CacheMeshBatchRendererGroup() inside the RenderMeshSystemV2 class. More specifically, this code inside that method:
Profiler.BeginSample("Add New Batches");
{
var sortedChunkIndex = 0;
for (int i = 0; i < sharedRenderCount; i++)
{
var startSortedChunkIndex = sortedChunkIndex;
var endSortedChunkIndex = startSortedChunkIndex + sharedRendererCounts[i];
while (sortedChunkIndex < endSortedChunkIndex)
{
var chunkIndex = sortedChunkIndices[sortedChunkIndex];
var chunk = chunks[chunkIndex];
var rendererSharedComponentIndex = chunk.GetSharedComponentIndex(RenderMeshType);
var editorRenderDataIndex = chunk.GetSharedComponentIndex(editorRenderDataType);
var editorRenderData = m_DefaultEditorRenderData;
if (editorRenderDataIndex != -1)
editorRenderData = EntityManager.GetSharedComponentData<EditorRenderData>(editorRenderDataIndex);
var remainingEntitySlots = 1023;
var flippedWinding = chunk.Has(meshInstanceFlippedTagType);
int instanceCount = chunk.Count;
int startSortedIndex = sortedChunkIndex;
int batchChunkCount = 1;
remainingEntitySlots -= chunk.Count;
sortedChunkIndex++;
while (remainingEntitySlots > 0)
{
if (sortedChunkIndex >= endSortedChunkIndex)
break;
var nextChunkIndex = sortedChunkIndices[sortedChunkIndex];
var nextChunk = chunks[nextChunkIndex];
if (nextChunk.Count > remainingEntitySlots)
break;
var nextFlippedWinding = nextChunk.Has(meshInstanceFlippedTagType);
if (nextFlippedWinding != flippedWinding)
break;
#if UNITY_EDITOR
if (editorRenderDataIndex != nextChunk.GetSharedComponentIndex(editorRenderDataType))
break;
#endif
remainingEntitySlots -= nextChunk.Count;
instanceCount += nextChunk.Count;
batchChunkCount++;
sortedChunkIndex++;
}
m_InstancedRenderMeshBatchGroup.AddBatch(tag, rendererSharedComponentIndex, instanceCount, chunks, sortedChunkIndices, startSortedIndex, batchChunkCount, flippedWinding, editorRenderData);
}
}
}
Profiler.EndSample();
Edit: I’m not personally using RenderMesh since I have my own rendering system now, but man would it be cool if I this wasn’t a bottleneck because my solution only works for a very specific use case that is a 2D game where similar entities cannot change material properties.
That backend system is RenderBoundsUpdateSystem so I was just responding to that. Without RenderBoundsUpdateSystem WorldRenderBounds will not be added to entities.
Anyway have you benchmarked this at runtime out of interest?
Ah okay gotcha.
Anyway yes, but not too in depth. I noticed a few things:
-
Batching only occurs every frame if you have a job with a translation component that is NOT ReadOnly and that runs every frame as well. Even if you don’t actually change that translation component on any entity, simply not having it on ReadOnly will make the batching run. I’m not 100% certain about this, but I strongly think that is the case. I think that’s why no matter how many buildings entities i create, 10k, 20k, 50k, the render system doesn’t care about them because they can only be moved on user input or destroyed via event, so the jobs that can change their translation doesn’t run every frame. The rendermesh system on them takes almost 0 ms so I’m fine with using it for those entities. However, that can’t apply to unit entities because their translation component can change every frame.
-
I noticed that turning off the RenderBoundsUpdateSystem before creating entities does not render them and also stops the batching like you said it would but if you turn it off after they were created it does nothing. Batching still occurs. Toggling it off/on appears to have no effect. I have yet to try another thing: Manually update the RenderBoundsUpdateSystem myself instead of every frame through a combination of calling it myself and add/remove those troublesome components when needed. I doubt very much this would work, so I won’t attempt it as I’m satisfied with my own solution. My own solution takes about 1/4th of the time with batching.
-
Nothing changes between Editor and Build. Batching takes almost just as long as in the Editor when in a build.
Earlier when I was attempting to hack my way through the system, I was trying to just say ‘ok don’t batch, just please render all chunks no matter where they are’ because that’s what my own implementation does anyway.
Coming from HybridRenderer tips and tricks?
I decided to take a deep dive into this code to figure out how it could be optimized. This isn’t an issue that affects me personally since my moving entities use GPU animation instancing, but since I figured out a means to optimize it, I figured I may as well describe it (don’t care enough personally to test it). As it turns out, most of this culprit code is actually totally Burst-able, which I suspect would drastically reduce this performance issue.
Inside the Add New Batches sample, there are three loops, forA, whileB, and whileC, where whileC is nested inside whileB. Inside whileB, there are two managed calls that we want to get out of these loops.
The first is EntityManager.GetSharedComponentData(editorRenderDataIndex). The result is only used in the second managed call, so we can just store that index somewhere else and make that call later.
The second is m_InstancedRenderMeshBatchGroup.AddBatch. This function requires two arrays that are defined and unmodified outside our loops. It requires the shared component, which if we know how to get by index (an index of -1 is default). tag is constant in our loops as well. And that just leaves us with 4 ints and a bool for arguments. Those are all blittable, which means we can pack them inside a NativeList of a struct.
Now all our loop code can be dropped into an IJob with a writable NativeList and a few ReadOnly NativeArrays. The EntityManager call can be removed. And the AddBatch call can be replaced with adding to the NativeList. Tack on [BurstCompile] to this job, create an instance where these loops used to be, and call Run() on it.
After the job, you can loop through the list and call GetSharedComponentData (account for -1 should get default instead) and then call AddBatch. And lastly don’t forget to dispose the NativeList.
If anyone with this performance issue is brave enough to try this out, I would love to see what kind of performance gains you get, if anything.
I would try it out if you could tell me how i can change such a system (or even just telling me which system it is :P)
I forget the details about embedding a package as it has been a while since I have had to do it, but it essentially involves copying the package out of the Library/PackageCache and pasting it into your packages folder in your project and then updating your manifest.json. I remember someone wrote a script to automatically embed a package for you in the PackageManager forums.
As for editing the code, the system is RenderMeshSystemV2.cs. Just search for Profiler.BeginSample(“Add New Batches”); and that will take you directly to the spot.
Ok i just got it “installed” by adding a simple debug.log statement to see if it works:
- remove the render package from manifest.json
- copy the package from the lib cache folder somewhere else
- add debug statement
- add package by hand via packagemanager → + → local package
- done
Now iam trying to get it into a job. I will post some updates here and see if it works and if you all can help me with that
Ok here it is.
So either iam doing something wrong and the code is shit (which it is) or the whole thing is not doing anything at all. I would say it is ~ 10fps slower then the old solution. I expected it to be more already .
One big thing i have problems with is setting the Array size. I just “randomly” went with 500, but i dont know how i could calculate this better.
So can you help me here and improve the code a bit? Just point me in the direction and i will improve on it.
Here is the job:
using System;
using Unity.Burst;
using Unity.Collections;
using Unity.Entities;
using Unity.Jobs;
using Unity.Mathematics;
using Unity.Rendering;
using UnityEngine;
namespace TestFolder.com.unity.rendering._3._4.Unity.Rendering.Hybrid
{
[BurstCompile]
public struct BatchingJob : IJob, IDisposable
{
[WriteOnly] public NativeArray<bool> NativeFlipped;
[WriteOnly] public NativeArray<int> NativeEditorRenderDataIndex;
[WriteOnly] public NativeArray<int4> NativeDataArray1;
public NativeArray<int> NativeCount;
[ReadOnly] public NativeArray<int> SortedChunkIndices;
[ReadOnly] public int SharedRenderCount;
[ReadOnly] public ArchetypeChunkComponentType<RenderMeshFlippedWindingTag> MeshInstanceFlippedTagType;
[ReadOnly] public ArchetypeChunkSharedComponentType<EditorRenderData> EditorRenderDataType;
[ReadOnly] public NativeArray<int> SharedRendererCounts;
[ReadOnly] public NativeArray<ArchetypeChunk> Chunks;
[ReadOnly] public ArchetypeChunkSharedComponentType<RenderMesh> RenderMeshType;
public void Execute()
{
var sortedChunkIndex = 0;
var index = 0;
for (int i = 0; i < SharedRenderCount; i++)
{
var startSortedChunkIndex = sortedChunkIndex;
var endSortedChunkIndex = startSortedChunkIndex + SharedRendererCounts[i];
while (sortedChunkIndex < endSortedChunkIndex)
{
var chunkIndex = SortedChunkIndices[sortedChunkIndex];
var chunk = Chunks[chunkIndex];
var rendererSharedComponentIndex = chunk.GetSharedComponentIndex(RenderMeshType);
var editorRenderDataIndex = chunk.GetSharedComponentIndex(EditorRenderDataType);
var remainingEntitySlots = 1023;
var flippedWinding = chunk.Has(MeshInstanceFlippedTagType);
int instanceCount = chunk.Count;
int startSortedIndex = sortedChunkIndex;
int batchChunkCount = 1;
remainingEntitySlots -= chunk.Count;
sortedChunkIndex++;
while (remainingEntitySlots > 0)
{
if (sortedChunkIndex >= endSortedChunkIndex) break;
var nextChunkIndex = SortedChunkIndices[sortedChunkIndex];
var nextChunk = Chunks[nextChunkIndex];
if (nextChunk.Count > remainingEntitySlots) break;
var nextFlippedWinding = nextChunk.Has(MeshInstanceFlippedTagType);
if (nextFlippedWinding != flippedWinding) break;
#if UNITY_EDITOR
if (editorRenderDataIndex !=
nextChunk.GetSharedComponentIndex(EditorRenderDataType))
break;
#endif
remainingEntitySlots -= nextChunk.Count;
instanceCount += nextChunk.Count;
batchChunkCount++;
sortedChunkIndex++;
}
NativeFlipped[index] = flippedWinding;
NativeDataArray1[index] = new int4(rendererSharedComponentIndex,
startSortedIndex, instanceCount, batchChunkCount);
NativeEditorRenderDataIndex[index] = editorRenderDataIndex;
index++;
}
}
NativeCount[0] = index;
}
public void Dispose()
{
NativeFlipped.Dispose();
NativeEditorRenderDataIndex.Dispose();
NativeDataArray1.Dispose();
NativeCount.Dispose();
}
}
}
And here is how the code is called:
Profiler.BeginSample("Add New Batches");
{
var length = 500;
var job = new BatchingJob()
{
NativeCount = new NativeArray<int>(1, Allocator.TempJob),
NativeEditorRenderDataIndex = new NativeArray<int>(length, Allocator.TempJob),
NativeDataArray1 = new NativeArray<int4>(length, Allocator.TempJob),
NativeFlipped = new NativeArray<bool>(length, Allocator.TempJob),
SharedRenderCount = sharedRenderCount,
SharedRendererCounts = sharedRendererCounts,
SortedChunkIndices = sortedChunkIndices,
Chunks = chunks,
MeshInstanceFlippedTagType = meshInstanceFlippedTagType,
EditorRenderDataType = editorRenderDataType,
RenderMeshType = RenderMeshType
};
job.Run();
for (int i = 0; i < job.NativeCount[0]; i++)
{
var editorDataIndex = job.NativeEditorRenderDataIndex[i];
EditorRenderData editorRenderData = m_DefaultEditorRenderData;
if (editorDataIndex != -1)
{
editorRenderData = EntityManager.GetSharedComponentData<EditorRenderData>(editorDataIndex);
}
var data1 = job.NativeDataArray1[i];
m_InstancedRenderMeshBatchGroup.AddBatch(tag, data1.x,
data1.z, chunks, sortedChunkIndices, data1.y,
data1.w, job.NativeFlipped[i],
editorRenderData);
}
job.Dispose();
}
A few things:
- Please share timeline profile screenshots with and without the modifications. Without this, it is impossible for me to know what is going on.
- Please share your jobs settings. Specifically whether or not leak detection and safety checks are enabled or disabled.
- Is there a reason you are using NativeArray instead of NativeList? I don’t think this will make a huge difference other than removing that 500 hardcoded limit.
Lastly, if you are really struggling, if you have a test project you would be willing to share with me, I would be happy to try my hand at it this weekend.
Here is a test project: GitHub - AskMeAgain/JobifiedHybridRenderer
EDIT: didnt knew that nativelist exists tbh …
EDIT2: there is no difference if you use the native list … both variants are ~ 134fps
I will try to take a look sometime this week!
In the meantime, it would be really helpful if you could share screenshots of your timeline recordings using each method, like what you posted in the third image here: https://imgur.com/a/Dvq8dxc That way I can make sure I am seeing the same issues you are seeing.
In that timeline, you’ll notice that in the latter half of “Add New Batches” you got this sort of comb-looking thing in the profiler. It’s the gaps between the teeth of this comb that we are trying to get rid of.
After some tests it is actually a little bit faster, but definitly not significant. Maybe 10 fps and it feels much more “stable”
So I did some digging, and wow did I dig up some stuff I was not expecting.
So the first thing to note is that those “AddBatch” samples that were acting like the teeth of the comb? Yeah…
Those were only capturing the very first part of the AddBatch method. It turns out it was capturing the native API AddBatch call only. Why? I have no idea. It’s relatively cheap.
Do you see that little gray text block before the first AddBatch? That’s the job we optimized, except in this profile capture I had Burst disabled on that job. Burst does help it out a little, but obviously this part is not the problem.
So what is going on in these AddBatch routines? Well first is the native AddBatch call and some setup, which represents the gaps in my profile capture.
Second is a loop through all the properties of the shader to find which ones are properties declared for instancing in ECS components. In our case, we only have one, which I suspect is the LocalToWorld matrix (or maybe color?). But I didn’t bother to investigate that far. Once it finds the matching property, it gets the property array pointer and initializes that property’s array to the default property value. Anyways, this array initialization work is probably the bulk of this loop’s processing, right?
Nope!
1 millisecond of our precious frametime is spent by Unity trying to find the damn properties!
For anyone who is brave enough to try and fix this, the optimization seems approachable. You’ll need a Dictionary<Shader, int2> and a List. The list contains the ECS TypeIndices of the properties the shaders reference, and the int2 is a start offset and length of the typeIndices for the given shader. If the shader is not in the Dictionary, you have to do that extra work of adding it and finding the typeIndices. But if it is in the dictionary, what used to be a bunch of string comparisons now becomes some integer lookups directly to what matters.
Alright, so that’s half the mystery. Now to the other half, the chunkCount loop.
Here is where it does some gymnastics with the chunks and eventually copies the instanced properties into the shaderProperty arrays. And those copies are what are taking so much time, right?
Again, no.
The matrix copy is quite tiny, and the little guy to the right is the shader property (which I now suspect is color) copy. Which means it must be the code before that matrix copy that is slow.
But why?
I haven’t figured that out yet. There’s no interaction with Shaders or the graphics API at all in this part of the code. It’s just accessing some array pointers from the chunks and then there’s an Add to a NativeMultiHashMap. If anyone is willing to dig deeper, I would love to hear what you find!
But all in all, it seems like there’s room for an order of magnitude of improvement in this code path. I need to get some sleep now. Thanks for reading!
Thanks for your digging!
Just a first idea after reading your text:
Maybe the MultiHashMap is getting rebalanced multiple times and this making is slower.