Is there a way to control back-face culling per-instance during GPU instancing?

I am drawing a lot of meshes using DrawMeshInstancedIndirect, however, some have a negative scale (flipped models) and these get rendered inside-out due to back-face culling.

Is there a way to control per-instance back-face culling (say in vertex shader)? If not, what is the best way to solve this?

Do I need to draw flipped meshes with a second call to DrawMeshInstancedIndirect with a different shader that has culling inverted? I am trying to avoid this since it adds a lot of complexity in bookkeeping buffers.

I found this issue, sadly marked as “Won’t fix”, and I believe that would be useful to avoid the need for copy-pasting a shader just to invert its culling: Unity Issue Tracker - Graphics.DrawMesh does not handle mirror (negative scale) transforms well, should flip backface culling properly

I was also considering disabling back-face culling and using facing : VFACE parameter of fragment shader, but that will still add extra rasterization cost compared to the true back-face culling done by hardware. Would this even work in surface shaders?

Any suggestions/tricks are welcome, thanks!

I’m using it on surface shaders just fine.
Just put

half vface : VFACE;

inside your surface/fragment input struct.

As for the culling direction, you can use a material property to drive it, along with other things:

Shader "shadername"
{
   Properties
   {
    [Enum(UnityEngine.Rendering.CullMode)] _Cull ("Cull", Float) = 2
    [Enum(UnityEngine.Rendering.BlendMode)] _SrcBlend ("Source Blend", Float) = 1
    [Enum(UnityEngine.Rendering.BlendMode)] _DstBlend ("Dest Blend", Float) = 0
    [Enum(UnityEngine.Rendering.CompareFunction)] _ZTest ("Z Test", Float) = 2
    [Enum(Off, 0, On, 1)] _ZWrite ("Z Write", Float) = 1
    }

    SubShader
    {
       Cull [_Cull]
       ztest [_ZTest]
       zwrite [_ZWrite]
       Blend [_SrcBlend] [_DstBlend]

       CGPROGRAM

       ......

You can then use Material Property Blocks to modify that without breaking GPU instancing.

But maybe this breaks instancing, since it’s a render state change?
I don’t know for sure, I have never actually verified it.

Using Material Property Blocks does break batching though. (different thing)

Correction after several edits: Material Property Blocks DOES break batching, but it DOES NOT BREAK instancing as long as your material’s shader properly supports GPU instancing and you have checked the “enable GPU instancing” option on the material inspector.
(this topic is quite confusing… :stuck_out_tongue: )

1 Like

You could do the backface culling in a geometry shader.

However, be aware that geometry shaders are frowned upon because they can be slow. Probably still better than doing it in a pixel shader because the clip command disables early-z optimization.

Another idea would be to double the vertex and index buffer so that you have both front- and back-facing polygons in it. You’d also have to add an vertex attribute to indicate whether a vertex belongs to clockwise or counter-clockwise winding. This would allow you to reject vertices in the vertex shader by setting SV_CullDistance to -1. Also, not entirely free, though, because the GPU would have to call the vertex shader twice as often.

I’m wondering if you could use a compute shader to fill a compute buffer with the arguments for the indirect draw call. You’d still need two submeshes with different winding because you can’t change the culling mode within a single draw call.

I would stay away from geometry shaders if you want to publish for multi-platform, as they are poorly supported and are slow in general. It’s a pity but vendors don’t seem to like them much, hence the poor support.
Apple’s Metal is infamous for not supporting them.

Thanks for the reply! Tthis is useful for making one shader that can support both culling modes, great tip!

Now this is an interesting idea! However, AFAIK material property blocks are unable to change anything that changes the render state. You can change the value, but that does not affect the render state.

And even if it did work, I don’t think there is a way of applying material properties per-instance, is there? One call of DrawMeshInstancedIndirect will get material + property block and that will be applied to all rendered instances.

Property blocks are useful when you need to do multiple calls with the same material but some minor changes. Even if this worked with the culling, I’d have to separate the normal and flipped instances to separate DrawMeshInstancedIndirect calls.

When using DrawMeshInstancedIndirect, I don’t think that batching is supported/considered. I also believe that “enable GPU instancing” on material has nothing to do with instancing using DrawMeshInstancedIndirect. All my materials have this off and it works fine.

This is also interesting, thanks for the ideas! It just seems even more complex than doing the two draw calls from infrastructure point of view. The question is performance, two draw-calls with less instances per call, or one draw call with more instances that does extra work? Any intuition on this? This is really hard to compare without some benchmark and I bet that performance will vary across GPU vendors and generations.

However, you mentioned that you can discard vertices from vertex shader by using SV_CullDistance? Can you elaborate how this works? I have many shaders where rejection in the vertex phase is used and what I do is just setting vertices to zeros so that resulting triangles have zero area and are discarded. Which one is better?

Yeah, I meant that you have to program the shader to support it using CBUFFERs (UNITY_INSTANCING_BUFFER) for the properties. The “enable GPU instancing” toggle in the material just enables the instancing keyword to activate that shader variant. I guess it’s implicit with DrawMeshInstancedIndirect.
And yeah, batching is only a thing when using Unity’s Mesh Renderers.
I just wanted to clarify things because I had edited my answer like 3 times :stuck_out_tongue:

Oh, I probably should have posted the shader for reference, here is a simplified but functional one:

Shader "InstancedShader" {
    Properties {
        _MainTex("Albedo (RGB)", 2D) = "black" {}
    }
    SubShader {
        Tags {"Queue" = "Geometry" "RenderType" = "Opaque" }

        CGPROGRAM
        #pragma surface surf Standard fullforwardshadows addshadow
        #pragma multi_compile_instancing
        #pragma instancing_options procedural:setupInstancing assumeuniformscaling

        sampler2D _MainTex;

        struct Input {
            float2 uv_MainTex;
        };

        struct InstanceData {
            float3 position;
            float scaleX; // either +1 or -1
        };

#ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
        StructuredBuffer<InstanceData> _InstanceData;

        void setupInstancing() {
            InstanceData data = _InstanceData[unity_InstanceID];
            unity_ObjectToWorld._11_21_31_41 = float4(data.scaleX, 0.0, 0.0, 0.0);
            unity_ObjectToWorld._12_22_32_42 = float4(0.0, 1.0, 0.0, 0.0);
            unity_ObjectToWorld._13_23_33_43 = float4(0.0, 0.0, 1.0, 0.0);
            unity_ObjectToWorld._14_24_34_44 = float4(data.position, 1.0);
        }
#endif

        void surf (Input IN, inout SurfaceOutputStandard o) {
            o.Albedo = tex2D(_MainTex, IN.uv_MainTex).rgb;
        }

        ENDCG
    }
}

If you have - lets say - 1000 draw calls but only one particular mesh has positive and negative scale, you’d end up with only 1001 draw calls. However, if you have 1000 instanced draw calls and all of them have positive and negative scale, you’d end up with 2000 draw calls. Usually, the first case is much more likely, so it’s probably not worth doing.

SV_CullDistance and SV_ClipDistance are special shader semantics for custom clip/cull planes to be used in addition to the frustum clip planes. It’s the distance of each vertex to the clip plane. If all 3 vertices of a triangle have negative cull distance, the triangle will be culled. If some but not all vertices have negative clip and cull distances, the triangle will be clipped but not culled.

Planes are usually given in the form dot(n, v) + d = 0, where n is the normal of the plane, v is a vertex position to test and d is the negative distance of the plane from the origin along the normal. dot(n, v) is the distance of v along the normal vector n, so dot(n, v) + d will be the desired distance of v from the plane.

But in your case, you wouldn’t have actual clip planes, you’d simply set the value to a positive or negative number. Probably the same as making zero area triangles, performance-wise. Alternatively, you can also move the triangle to the negative side of the near plane in clip space, where they are culled as well (moving them to the view-space origin should do the trick for a perspective projection). Personally, I’d prefer a culled triangle over a degenerate one but performance-wise it’s probably the same. There are many ways to skin a cat :wink:

https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/user-clip-planes-on-10level9

1 Like