DrawMeshInstancedIndirect Example Comments and Questions

Hi,

Just tested out the example provided for the new DrawMeshInstancedIndirect method from here and it works really well. However it took a little while to sort out as its just the scripts and I made a few changes regarding the Surface Shader that might be useful to incorporate back into the example.

Firstly some quick notes
Setting up a new project/scene in Unity will likely default to having soft shadows and depending on the quality setting this might include 2 to 4 cascades. This can heavily impact the performance since each cascade requires rendering all the instances again. Unfortunately I was testing in an existing project set to 4 cascades so the performance was roughly 4 times less than having no shadows.

I recommend initially you set up the quality to no cascades and maybe even test with shadows disabled on the directional light.

Its also best to run in deferred rendering mode as forward mode appears to take an extra little hit with additional passes such as the depth pass that is needed when doing shadows or if you use more than one light source. Mind you Iā€™ve not tested with multiple lights, I did try with DrawMeshInstanced and found they werenā€™t rendered at all in forward.

I also noticed unlike DrawMeshInstanced Unity is unable to report the total tris/verts in the stats overlay. Unsure if that is a bug or simply not possible?

Example Code
Made a couple of changes here.

Firstly I moved the SetBuffer() call from Update() to end of UpdateBuffers() since the data doesnā€™t change unless you change the instance count. Having it in Update() didnā€™t seem to affect performance, but seems odd having it there.

Secondly I added a conditional check in UpdateBuffers for instance count being 0 as that will cause errors.
e.g.

if ( instanceCount < 1 ) instanceCount = 1;

Surface Shader
In order to correctly render shadows you need to add additional pragma defines, specifically ā€˜addshadowā€™.
I imagine for forwardShadows you would need to add ā€˜fullforwardshadowsā€™ too, but didnā€™t test that.
e.g.

#pragma surface surf Standard addshadow

Finally in setup() I changed _Time.x to _Time.y to speed up the rotation of the assigned mesh that is instanced, otherwise it can be quite hard to see the movement.

On my GTX970 it was able to render 2 million individually rotating cubes at approx 35 fps, that dropped to 18 fps with no cascade shadows in deferred rendering, which is to be expected.

So now the Questions
I assume that for surface shaders the setup() function is required to derive the ObjectToWorld and WorldToObject matrices as those are no longer being generated or passed in by Unity?

I was able to add a custom vertex method to the surface shader that used ā€˜unity_InstanceIDā€™ to grab specific instance data in order to modify the vertex positions, but have been unable to do the same inside the Surf function.

For example this code

    void surf (Input IN, inout SurfaceOutputStandard o)
        {
#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            float4 col = colorBuffer[unity_InstanceID];
#else
            float4 col = float4(0,0,0,1);
#endif

            // Albedo comes from a texture tinted by color
            fixed4 c = tex2D (_MainTex, IN.uv_MainTex) * col;
            o.Albedo = c.rgb;
            // Metallic and smoothness come from slider variables
            o.Metallic = _Metallic;
            o.Smoothness = _Glossiness;
            o.Alpha = c.a;
        }

Results in the instances being rendered as ā€˜blackā€™ ( not completely due to ambient lighting etc ). This would suggest that UNITY_PROCEDURAL_INSTANCING_ENABLED is no longer defined for the surf method. if I remove the define conditional check then I get the error
undefined variable ā€œcolorBufferā€ undefined variable ā€œunity_InstanceIDā€

Not sure what I need to do to resolve this?

Finally at some point I want I want to investigate per instance culling by using a computeShader to calculate which instances are within the frustrum and thus fill the instances per TRS matrices into a computeBuffer. I think from memory that is possible using one of the ComputeBuffer types and reading back the count value or something. Would this approach work?

1 Like

Hey,

Great to see you getting such good results with this new API! And glad our example code got you up and running quickly!

Correct. Itā€™s possible in theory, but our current tech relies on knowing the instance count on the CPU, which isnā€™t necessarily true here. While the demo script does know it, this API lets you populate the count on the GPU. Reading back this number to the CPU would be very slow. We may look at solutions for this in the future though.

Thanks for the fix. Iā€™ll update our docs!

Good idea! Although I think Iā€™ll just change the GUI slider range in the docs, to fix this in the example :slight_smile:

Thanks again - added to the docs example!

Good idea - updated the docs

Correct. You can set them up however you like. Maybe your instance data buffer will contain full matrices. Maybe you have no need for rotation. Maybe your instance data only contains a theta, to spin the meshes around 1 axis. Notice this is where we configure the custom rotation, in our example. If you didnā€™t need rotation, you could set them up far more efficiently, and probably get even more cubes rendering at a good fps :slight_smile:

I was able to add this code to the example, in the surf function, and it works fine for me:

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            c.gb = (float)(unity_InstanceID % 256) / 255.0f;
#else
            c.gb = 0.0f;
#endif

This makes the cubes appear more/less red based on their instance id. Does that code work for you?

Yes, this is possible, and is a great use for this tech! If you have your raw instance data in a ComputeBuffer, you can dispatch a ComputeShader to inspect every item, and add each visible instance to an Append Buffer. This Append Buffer would then become the input to DrawMeshInstancedIndirect. The ComputeShader would simply need to load the instance position, and test that against the camera frustum (factoring in the size of the instance too, i.e. itā€™s Bounding Sphere or AABB. Then, you can use ComputeBuffer.CopyCount to copy the culled instance count into the Indirect Args Buffer used for DrawMeshInstancedIndirect.

You could even implement hierarchical culling etc, if you wanted. Itā€™s totally up to you how you filter your instances in your ā€œCullingComputeShaderā€.

Hope this all helps, good luck, and thanks again for the great feedback!

1 Like

I apologize in advance if I hijack your thread like this, but Iā€™ve had some time with DrawMeshInstancedIndirect myself yesterday and found various issues and I felt it would be best to merge all related issues into one thread rather than opening a new thread.

First off, I have made very similar changes to the script and moved SetBuffer() down to the UpdateBuffers() method among other various miscellaneous changes. That said, my primary concern was using the new DrawMeshInstancedIndirect API to render rich patches of vegetation, both ground cover and trees. Vegetational assets tend to have multiple submeshes so I made various changes to support rendering a dynamic amount of submeshes, with an option to turn off specific submeshes for debugging purposes. My steps were as follows:

  • Created new CommandBuffers for each submesh synonymous to the already existing argsBuffer and fed it with the correct amount of indices of the respective submesh. I figured, that alternatively I could reuse the same CommandBuffer to populate the indices and use the argsOffset parameter to point it accordingly. But I wanted to make sure there werenā€™t any issues tied to that approach.

  • Created and assigned new materials used by each submesh and fed the positionBuffer respectively.

  • Rendered each submesh using DrawMeshInstancedIndirect using the correct material and CommandBuffers, e.g.;

  • Graphics.DrawMeshInstancedIndirect(instanceMesh, 0, instanceMaterial1, bounds, argsBuffer1, mpb, ShadowCastingMode.On, true);

  • Graphics.DrawMeshInstancedIndirect(instanceMesh, 1, instanceMaterial2, bounds, argsBuffer2, mpb, ShadowCastingMode.On, true);

In my tests I noticed that the performance differs greately depending on the mesh used. Singular meshes with no submeshes seemed to generally perform better and some meshes with submeshes even managed to crash Unity occasionally. I couldnā€™t find a reliable way to reproduce the crash but Iā€™ll keep experimenting.

I also couldnā€™t find an explanation as to why submeshes seem to perform worse, even horribly at times. There seems to be a huge overhead associated to rendering meshes with submeshes and the frametime occasionally spiking up to a multitude of its rendering time. Hereā€™s a screencap of the profiler showcasing one of the mentioned occasions:


I couldnā€™t manage to break up Gfx.WaitForPresent into more atomic ops even with the Deep Profiler turned on. And I made sure VSync was turned off as well. I can definitely tie this overhead to the DrawMeshInstancedIndirect method, as turning it off during runtime got rid of any time spent in WaitForPresent.

Another issue I noticed was that DrawMeshInstancedIndirect seems to have issues rendering submeshes properly. In my tests, I rendered the same mesh + 2 submeshes via the default MeshRenderer and DrawMeshInstancedIndirect using the same material. Here is the screencap of the model rendered by the default MeshRenderer:


And here is the same meshes rendered in DrawMeshInstancedIndirect, while correctly iterating through all its submeshes:

The treeā€™s first two submeshes rendered just fine while the last submesh seemed to be a subset of the first submesh. Itā€™s rather odd. I managed to reproduce the issue with every mesh I had that featured submeshes and it was always the last submesh that did not render properly.

I will spend more time experimenting with the new instancing API and report back whenever I find new oddities. The issues reported in this post will be filed in a bug report later today with the project I prepared.

Iā€™m wondering if anyone else can confirm my sightings.

1 Like

Hey, thanks for the help.

So this is weird. I tried your example and it does work, so I went back over my version and was still unable to get it to work until I explicitly set the else condition to be non-black!

e.g.
This results in all black cubes

float4 col = 1.0f;

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        col = colorBuffer[unity_InstanceID];
#else
        col = float4(0, 0, 0, 1);
#endif

fixed4 c = tex2D(_MainTex, IN.uv_MainTex) * col;

Yet this, where i simply make the else color blue results in all different colored cubes

float4 col = 1.0f;

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        col = colorBuffer[unity_InstanceID];
#else
        col = float4(0, 0, 1, 1);
#endif

fixed4 c = tex2D(_MainTex, IN.uv_MainTex) * col;

This is based on passing in a structuredBuffer to the shader using simple c# code to make a random color

colors[i]        = new Vector4( Random.value, Random.value, Random.value, 1f );

<snip>

colorBuffer.SetData(colors);
instanceMaterial.SetBuffer("colorBuffer", colorBuffer);

I can only assume some weird edge case in the shader compiler happening here?

Donā€™t forget to update the pad input code too then since its clamped to 0 as well.

Canā€™t help with the submesh issue as iā€™ve avoided using them, in fact I had assumed up till recently they wouldnā€™t be supported with instancing, though not sure why I thought so.

As for Gfx.WaitForPresent I wouldnā€™t worry about it. Unless Iā€™m wrong ( please someone correct me if I am) that normally suggests you are GPU bound which would be expected when using instancing. As long as youā€™re overall framerate is above your desired value ( eg. 60 fps ) then it should just mean you have cpu time to burn for whatever purposes youā€™d like. If you are below target framerate then you are simply pushing the gpu too hard and will have to cut back.

For example most of my R&D with this is towards extreme numbers of fully skinned meshes, each with a unique animation ( yeah been playing too much Total War: Warhammer ). Whilst developing with DrawMeshInstanced I noticed my gpu fans start up immediately and checking the gpu I can see it jump instantly to 100% usage, something I have rarely if ever seen in any of my other Unity projects and in even in many games. Of course this is mostly happening due to disabling v-sync, so the GPU simply pushes as hard as it can, and as instancing is so performant, pushing millions of verts/tris per frame ( I think I was pushing 35 million tris/35 millon verts all palette matrix skinned a frame at 80 fps on my GTX970) the gpu is for once is fully utilised, thus takes longer than normal ( or more specifically, longer than the cpu ) to render a frame.

Yeah that is weird!

It seems that we have a bug in DrawMeshInstancedIndirect code that results in incorrect submesh to render. Will fix asap.

1 Like

I donā€™t mean to push, but any ETA on this? Couldnā€™t find anything in the changelogs so far that would address this as fixed.

Came across a few more bugs

862862 ComputeBuffer CopyCount Broken
As far as I can tell copycount into an argumentBuffer for an AppendBuffer appears broken or Iā€™m using it very wrongly. This is awkward for DrawMeshInstanceIndirect since the whole point is to avoid any gpu readbacks.

Current workaround is to use a readback i.e. copycount into new computeBuffer then getData() on it, finally put the value ( instances) into the real argument array and copy that to the argumentBuffer.

864476 DrawMeshInstanceIndirect: Shadow Passes Broken
This is a bit of weird edge case, where is would appear from investigation that there are situations where Unityā€™s created Shadow Pass drawmesh calls are passing the wrong computeBuffer data to it.

Specifically I have one computeShader that determines the indices of instances that are within frustum and places the index into one of four appendBuffers. This way I can make four calls to DrawMeshInstanceIndirect(), each providing a specific mesh and specific ā€˜valid instance indices look up arrayā€™ to draw all four Mesh Lods. This works fine for rendering the instances, but enable shadows and you get the results in the attachment below.

What appears to be happening is that each Unity Shadow pass drawMesh call is using the same indices appendBuffer, instead of the correct one. Its almost like the last value assigned in the normal drawMesh calls is used instead.

Again was able to work around this by adding a fifth appendBuffer that stores all indices within frustum and then draws them all with a single Lod Mesh.

863817 ComputeShaders: No SetVectors() or SetBools()
I noticed that for ComputeShaders there is no SetVectors(), or SetBools() methods. Also there isnā€™t a uint type, but unsure if that is required or if simply passing an int will automatically get cast?

Finally I ran into a number of issues when switching from making one DrawMeshInstanceDirect to four, one for each LOD. Initially I just updated the code to run in a loop, but it didnā€™t work. I think something like Unity didnā€™t store the changes I made to the buffers in the material between each drawcall being cached. This sounds logical, so I figured MaterialPropertyBlocks would be the way to go. However using one block had the same problems. I guess maybe you have to use multiple MPB? In the end I went simple and just created duplicates of the material and assigned the buffers to the create one. Need to look at this again as I canā€™t help feeling iā€™m missing something.

1 Like

Did you use the example code under Start for creating your argumentBuffer? The example just looks odd to me, in particular the second ComputeBuffer argument for stride.

Finally got around to installing this beta and implementing DrawMeshInstancedIndirect into my own scene to test out the performance gains. It did lower my CPU load by 1.5ms or so over DrawMeshInstanced. Iā€™m pleased with that!

I noticed everyone in this thread relocated the SetBuffer() call from Update() to the end of UpdateBuffers(), and even the documentation was updated to reflect that. However, I noticed that without the SetBuffer() call in Update() my objects would get cleared from the screen randomly or anytime I took the Window focus away from Unity and back again.
Iā€™m sure someone from Unity is thinking ā€“ ā€œNow I remember, thatā€™s why it was in the Update()ā€ā€¦
Iā€™d love to know the real technical reason why this happens however, or if thereā€™s another workaround.

Iā€™m filling the buffer with a positional Matrix4x4 for each instance (w/ nonuniform scaling) and was wondering if there was a ā€œproperā€ way of getting the inverse on this for unity_WorldToObject within the setup? I did find a function that was graciously posted on the forums that works fine, however, am curious to know if anyone knows what the correct or most optimal way of obtaining this is.

I think what we also need is a function to create a 4x4 TRS matrix inside a shader, allowing us to only populate the buffers with raw transform data to reduce its byte-size. Iā€™ve written my own shader function to achieve this for now, but it only allows for uniform scaling. Also, I honestly think thereā€™s no need to re-invent the wheel here either - exposing such a function in the Unity.cginc would benefit everyone.

Quick update on the assumed copycount bug. It wasnā€™t a bug, but silly user error, I had got confused between the argument array and the argument buffer, so I had used a dstOffset as an index, instead of index of bytes. I.e. I had used a value of 1 instead of 1 * sizeof(uint)!

Damn silly mistake, that I just never caught despite looking at the line over and over. On the plus side due to the bug report Unity will amend the method declaration to state dstOffsetBytes, improve the documentation and possibly add a check to ensure the offset is a multiple of 4.

Iā€™ve applied the fix and it did provide a performance increase, though not quite as much as I had hoped for. I need to profile further as I was sure that using the readback method to set the copycount was stalling the gpu significantly.

Oh and for reference to anyone else wondering about this, here is the reply to my questioning the lack of SetVectors and SetBools. Seems to make sense.

2 Likes

Not sure about the randomly getting cleared from screen bit, but on my machine Iā€™ve found when using buffers if I alt-tab out of Unity editor or if running windowed and switch to another application that I often lose the effect of the shader as if the buffer has been lost. I wonder if that is the same situation here. Weird thing is on a colleagues machine with a different gpu but the same brand ( nvidia ) this doesnā€™t happen.

Iā€™m really unclear as to the responsibility of the developer in this case. Should we be constantly resetting the buffer every frame or not? It seams silly to think that we should, but I believe there are cases where we currently might have to. perhaps this is an area Unity can look into more.

This might be a good idea, iā€™m pretty sure I did some work on this too as well as playing around with whether or not the inverse was needed specific shaders.

However I donā€™t think its a good idea in general to generate it ( a proper inverse ) in the shader since it ( I assume ) must create quite a large overhead as it will need to be re-generated for every vertex and geneerating a true inverse is a costly function to begin with.

Obviously for some specific cases there is no choice, it has to be generated in the shader, but I think where ever possible it would be more advantageous to generate it on the c# side. Perhaps for best overall efficiency maybe do it in a ComputeShader, so we gain the speed of GPU calculations, as well as avoiding having to pass the data from cpu to gpu and of course its only done per matrix instead of per vertex as it would in a shader.

Itā€™s a trade-off, so itā€™s probably not possible to say whatā€™s best in different use-cases. eg:

  • a low-poly mesh vs a high poly mesh
  • the cost of cpu matrix inversions vs. gpu (including bandwidth to upload the constant data)

Iā€™d be inclined to try generating the inverses in a compute shader, so you can keep the logic/data on the GPU, while still doing the inversion once per instance, instead of per vert. But this still increases memory bandwidth usage, compared to the pure ALU solution of doing it in the vertex shaderā€¦ so many variables to consider :slight_smile:

If it helps, here is a vertex-shader solution Iā€™ve been using during prototyping for something elseā€¦ it may contain bugs, and may not be as fast as it could be. It does show that there is quite a lot of ALU required to do this task though :frowning:

            // transform matrix
            unity_ObjectToWorld._11_21_31_41 = float4(data.transform._11_21_31, 0.0f);
            unity_ObjectToWorld._12_22_32_42 = float4(data.transform._12_22_32, 0.0f);
            unity_ObjectToWorld._13_23_33_43 = float4(data.transform._13_23_33, 0.0f);
            unity_ObjectToWorld._14_24_34_44 = float4(data.transform._14_24_34, 1.0f);

            // inverse transform matrix
            float3x3 w2oRotation;
            w2oRotation[0] = unity_ObjectToWorld[1].yzx * unity_ObjectToWorld[2].zxy - unity_ObjectToWorld[1].zxy * unity_ObjectToWorld[2].yzx;
            w2oRotation[1] = unity_ObjectToWorld[0].zxy * unity_ObjectToWorld[2].yzx - unity_ObjectToWorld[0].yzx * unity_ObjectToWorld[2].zxy;
            w2oRotation[2] = unity_ObjectToWorld[0].yzx * unity_ObjectToWorld[1].zxy - unity_ObjectToWorld[0].zxy * unity_ObjectToWorld[1].yzx;

            float det = dot(unity_ObjectToWorld[0], w2oRotation[0]);
   
            w2oRotation = transpose(w2oRotation);

            w2oRotation *= rcp(det);

            float3 w2oPosition = mul(w2oRotation, -unity_ObjectToWorld._14_24_34);

            unity_WorldToObject._11_21_31_41 = float4(w2oRotation._11_21_31, 0.0f);
            unity_WorldToObject._12_22_32_42 = float4(w2oRotation._12_22_32, 0.0f);
            unity_WorldToObject._13_23_33_43 = float4(w2oRotation._13_23_33, 0.0f);
            unity_WorldToObject._14_24_34_44 = float4(w2oPosition, 1.0f);

The world matrix is uploaded as a float3x4 transform;

2 Likes

I believe this is the same issue Iā€™m encountering and also suspected it may be dependent upon the GPU used. In my case it happens consistently on an AMD R9 series GPU.

1 Like

Do you happen to have a simple example project you could log as a bug with Unity? The only project I have where I noticed this is rather large and complex so not a good case study. Even if its not an explicit bug, reporting it might get some official stance on dealing with it.

1 Like

I unfortunately donā€™t have a simple example project with this bug either. Like you my main project which Iā€™m seeing this occur in is way too large and complex to submit. Which is likely a determining factorā€¦
Whatā€™s interesting is that within this main project a new simple scene with just a camera and Unityā€™s example script/shader in it Iā€™ll get the same disappearing act. However, in a completely new project Iā€™m unable to duplicate the issue even with the same scene and project settings configured.
Iā€™m afraid I may have to resort to removing assets/plugins one by one in my main project in hopes of tracking this down, or just leave my workaround in place for now and hope it magically fixes itself later.

1 Like

I just hit on an interesting issue using 5.06b4 with the example project: GitHub - noisecrime/Unity-InstancedIndirectExamples: Exploring Unity 5.6 InstanceIndirect Method to render large numbers of meshes ā€“

In the editor with 1.1m instances, vsync off and the profiler recording (deep mode disabled) I get 2.7ms CPU times for about 340fps. As soon as I disable recording in the profiler, CPU time doubles to 5ms and framerate drops to ~180fps. Maybe that is reproducible/related to some of the other issues we are seeing?