Performance issue with DrawMeshInstancedProcedural()

Hi everyone,

I am very new to Unity and gamedev (I’m coming from the webdev world) so I’d like to apologize in advance if my question was already adressed or if this thread has not been posted at the proper place.

Following a tutorial, I’ve been trying to render a simple animated sin wave in 3D using a compute shader. Basically, I have a C# script that uses the compute shader for calculating the position of each point of the wave (they share a compute buffer), and then draws each point using Graphics.DrawMeshInstancedProcedural().

When the wave is made of 100 points, the script runs at roughly 30fps.
When the wave is made of 2500 points, the fps drops to 5fps…
So, there obviously is a performance issue here, and I tried to understand what was going on using the profiler but I am a bit stuck.

Here is what it looks like:


My understanding is that the “Editor loop” is at fault here. Yet, I tried to build and run the app and the issue persists outside of the editor…
I do not have both the “scene” and the “game” tabs visible at the same time.

To be precise, I’m using Unity 2021.1.21f1 on a Macbook pro (MacOS 11.5.2).

I hope I have been clear and complete enough.
Any idea what I’ve been doing wrong?

Thanks.

about “Semaphore.WaitforSignal” and “Gfx.WaitForPresentOnGfxThread”
Extremely slow editor in 2019.2.0a7 page-3#post-5392530

and “Gfx.WaitForPresent”

This means you are overloading your GPU. Since 2500 elements is nothing for even integrated GPUs, there’s probably something wrong with the way you coded your algorithm. Please show your computer shader and the code you’re using to dispatch it.

1 Like

To add to the above, also what mesh are you using for each point? If each “point” is a very high resolution mesh, that may explain your performance.

Thanks you all for your help.

@Neto_Kokku Here is the compute shader in question:

#pragma kernel WaveKernel
#define PI 3.14159265358979323846

RWStructuredBuffer<float3> _Positions;
uint _Resolution;
float _Step;
float _Time;

float2 GetUV (uint3 id) {
    return (id.xy + 0.5) * _Step - 1.0;
}

void SetPosition(uint3 id, float3 position) {
    if (id.x < _Resolution && id.y < _Resolution) {
        _Positions[id.x + id.y * _Resolution] = position;
    }
}

float3 Wave(float u, float v, float t)
{
    float3 p;
    p.x = u;
    p.y = sin(PI * (u + v + t));
    p.z = v;
    return p;
}

[numthreads(8, 8, 1)]
void WaveKernel (uint3 id: SV_DispatchThreadID) {
    float2 uv = GetUV(id);
    SetPosition(id, Wave(uv.x, uv.y, _Time));
}

As you can see, I’m rendering the function f(x, z) = sin(PI * (x + z + t)), where x and z belong to the interval [-1;1].
_Resolution is the number of points to be rendered on each axis.
_Step is the size of a single point. Each point being a cube of 1x1x1 => _Step = 2f / _Resolution.
_Time is self-explanatory.

@richardkettlewell I’m using a simple cube mesh, but I think I found the root of the issue.

The tutorial makes us create a Standard Surface Shader so that the color of each cube depends on its position in space.
Here is the full shader :

Shader "Graph/Point Surface GPU" {
   Properties {
      _Smoothness("Smoothness", Range(0,1)) = 0.5
   }

   SubShader {
      CGPROGRAM
      #pragma surface ConfigureSurface Standard fullforwardshadows addshadow
      #pragma instancing_options assumeuniformscaling procedural:ConfigureProcedural
      #pragma target 4.5

      struct Input {
         float3 worldPos;
      };

      float _Smoothness;

      #if defined(UNITY_PROCEDURAL_INSTANCING_ENABLED)
      StructuredBuffer<float3> _Positions;
      #endif

      void ConfigureProcedural () {
         #if defined(UNITY_PROCEDURAL_INSTANCING_ENABLED)
         float3 position = _Positions[unity_InstanceID];

         unity_ObjectToWorld = 0.0;
         unity_ObjectToWorld._m03_m13_m23_m33 = float4(position, 1.0);
         unity_ObjectToWorld._m00_m11_m22 = _Step;
         #endif
      }

      void ConfigureSurface (Input input, inout SurfaceOutputStandard surface) {
         surface.Smoothness = _Smoothness;
         surface.Albedo = saturate((input.worldPos * 0.5) + 0.5);
      }
      ENDCG
   }

   FallBack "Diffuse"
}

It turns out I have its equivalent for the URP (a Lit Shader Graph) which runs quite well: 30fps for around 160 000 points!
The BRP, on the other hand, is awfully slow…

1 Like

That looks pretty straightforward, what do your rendering C# code looks like?

Now I realized you’re using DrawMeshInstancedProcedural, not the indirect variant (which is the one I’m used to), maybe there’s something funky going on with that in the Unity side. I suggest using RenderDoc to capture the rendering frame and see what is really going on the GPU side. It’s a very useful tool when working with compute shaders and more custom rendering.

it should be pretty much the same as the indirect one. it’s actually faster if your script knows the draw counts, because you dont have to waste time assigning them to the argsBuffer, like you do for the indirect version.

1 Like