Graphics.DrawMeshInstancedIndirect's Setup Function Caching

Hi there,

I’m using Graphics.DrawMeshInstancedIndirect to draw a ton of cubes. The cubes don’t move.

I am calculating the unity_objectToWorld and unity_worldToObject matrices on the GPU in a compute shader and caching the matrices. Then, in the InstancedIndirect shader’s setup function I am using the cached matrices rather than recomputing them in the setup function.

I assumed this would have a speed up over simply re-computing the matrices in the setup function, but it does not.

This leads me to believe that the setup function is performing some kind of caching or at least has some intelligence behind how often it is called. But I can’t figure out how the InstancedIndirect shader would know when not to call setup… So Im totally perplexed at how recomputing the objectToWorld and worldToObject matrices in setup is as fast as simply accessing pre-computed values.

It comes down to 3 options:

  • Computing the matrices is a super fast operation, as fast as the cache memory access
  • The setup function isn’t called every time I call Graphics.DrawMeshInstancedIndirect (this cant be true, right?)
  • The bottleneck is elsewhere (probably shadow casting)

For some numbers: to render 1,000,000 cubes in the editor AND in the game view, with shadow casting and receiving on takes about 250 ms on my 5 year old lenovo laptop with a Nvidia 750m.

To render 1,000,000 cubes in the editor to just the game view with no shadows takes about 33ms.

I was under the impression that caching the objectToWorld, worldToObject matrices would speed this up, but it does NOT. Why is that?

Add a Profiler.BeginSample around the compute shader calculations and look in the GPU profiler to see how long time the million matrix calculations actually take.

1 Like

33ms… maybe just a coincidence but have you got vsync on?

Just a wild guess: maybe your original matrix computation is rather simple with just one or two vectors fetched from the compute buffer? It is probably faster than fetching 2 matrices, because modern GPUs usually do arithmetic operations very fast, comparing to accessing buffers. That is even true if you access more buffers later in the vertex program, because then the GPU can do the arithmetic operations in parallel when waiting for the result to come from resource accessing.

1 Like

33ms… maybe just a coincidence but have you got vsync on?

No, it’s actually a coincedence.

Also a co-incidence is that I’m using your inverse matrix calculation from this thread

Here’s the setup function:

        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            StructuredBuffer<int4> gridKeys;
        #endif

        void setup()
        {
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            int4 voxel             = gridKeys[unity_InstanceID];
            int3 gridKey         = voxel.xyz;
            float3 halfs         = {0.5f, 0.5f, 0.5f};
            float3 position     = _CellSize * (gridKey + halfs);

            float4x4 blah = {
                _CellSize,    0,            0,            position.x,
                0,            _CellSize,    0,            position.y,
                0,            0,            _CellSize,    position.z,
                0,            0,            0,            1
            };
            unity_ObjectToWorld = blah;
        
            // inverse transform matrix
            // taken from richardkettlewell's post on
            // [URL]https://forum.unity3d.com/threads/drawmeshinstancedindirect-example-comments-and-questions.446080/[/URL]

            float3x3 w2oRotation;
            w2oRotation[0] = unity_ObjectToWorld[1].yzx * unity_ObjectToWorld[2].zxy - unity_ObjectToWorld[1].zxy * unity_ObjectToWorld[2].yzx;
            w2oRotation[1] = unity_ObjectToWorld[0].zxy * unity_ObjectToWorld[2].yzx - unity_ObjectToWorld[0].yzx * unity_ObjectToWorld[2].zxy;
            w2oRotation[2] = unity_ObjectToWorld[0].yzx * unity_ObjectToWorld[1].zxy - unity_ObjectToWorld[0].zxy * unity_ObjectToWorld[1].yzx;

            float det = dot(unity_ObjectToWorld[0], w2oRotation[0]);

            w2oRotation = transpose(w2oRotation);

            w2oRotation *= rcp(det);

            float3 w2oPosition = mul(w2oRotation, -unity_ObjectToWorld._14_24_34);

            unity_WorldToObject._11_21_31_41 = float4(w2oRotation._11_21_31, 0.0f);
            unity_WorldToObject._12_22_32_42 = float4(w2oRotation._12_22_32, 0.0f);
            unity_WorldToObject._13_23_33_43 = float4(w2oRotation._13_23_33, 0.0f);
            unity_WorldToObject._14_24_34_44 = float4(w2oPosition, 1.0f);
        #endif
        }

And the setup function using the cached scheme:

        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            StructuredBuffer<float4x4> matrices;
            StructuredBuffer<float4x4> inverseMatrices;
        #endif

        void setup()
        {
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            unity_ObjectToWorld = matrices[unity_InstanceID];
            unity_WorldToObject = inverseMatrices[unity_InstanceID];
        #endif
        }
1 Like