Compute Shader variant compilation time excessive

I’ve been working on a Compute Shader performing a GPU simulation for ~6months, but recently the compile time for my shader has been steadily increasing to unmanageable levels (“Compiling Compute Variants”). I’ve gone from 20-30 seconds to 6/7 minutes in a matter of days.

I have tried clearing out my project library with no luck and there have been some suggestions this can get corrupted.

I’ve narrowed done the suspected issue to a specific function I made to create logs in a compute buffer (that can then be read on the CPU side). The function itself is not different to many others I have, but I am calling it much more often from all over the simulation.

I assume unity is trying to do something clever and creating lots of different variants, but I really don’t understand why its deciding to do this suddenly, why it is necessary (since the added code is fundamentally no different to the existing code), or how to avoid it. Any help/suggestions would be appreciated!!

uint AddNotice(uint id, uint type, uint notice, float4 noticeData) {
    //Dont bother trying to add if it is known that the buffer is full;
    if (GPUnoticeBufferSize > CounterBufferIndex[0].noticeBufferCount) {
        bool valid_slot = false;
        uint count = 0;
        uint currentIndex;
        uint noticeIndex;
        
        while (valid_slot == false && count < GPUnoticeBufferSize) {
            
            // Compute the next index (circular) - although current index is used;
            InterlockedAdd(CounterBufferIndex[0].noticeIndex, 1, currentIndex); // % projectileBufferSize;
            currentIndex = currentIndex % GPUnoticeBufferSize;                      
   
            //Keep within bounds of buffer size;
            if (CounterBufferIndex[0].noticeIndex >= GPUnoticeBufferSize) {
                uint expectedValue = CounterBufferIndex[0].noticeIndex;
                uint wrappedIndex = expectedValue % GPUnoticeBufferSize;
                            
                while (expectedValue != wrappedIndex) {
                    InterlockedCompareExchange(CounterBufferIndex[0].noticeIndex, expectedValue, wrappedIndex, expectedValue);
                    expectedValue = CounterBufferIndex[0].noticeIndex;
                }
            }
            
            noticeIndex = currentIndex;
                 
            //If notice slot is empty - then take this slot
            if (GPUExtractNoticesBuffer[noticeIndex].noticeRead == 0) {
              
                GPUExtractNoticesBuffer[noticeIndex].notice = notice;
                GPUExtractNoticesBuffer[noticeIndex].obj = id; 
                
                    //CALC COMBINED TYPE
                    uint objectBaseType = clamp(type, 1, 2);// Ensure objectBaseType is valid (1 for unit, 2 for projectile)                  
                    uint actualType = 999999999;
                    if (objectBaseType == 1) {
                        actualType = CombinedBuffer[id].CBunitType;
                    } else if (objectBaseType == 2) {
                        actualType = CombinedProjectileBuffer[id].PRJtype;
                    }
                    uint combinedType = (actualType * 10) + objectBaseType;  // Combine the types
                
                GPUExtractNoticesBuffer[noticeIndex].objType = combinedType;
                GPUExtractNoticesBuffer[noticeIndex].noticeData = noticeData;
                GPUExtractNoticesBuffer[noticeIndex].noticeTime = totalSimTime;               
                GPUExtractNoticesBuffer[noticeIndex].noticeRead = 1;  //0 = No Notice, 1=Awaiting Reading, 2 = Read
                                               
                //Increase used projectileBuffer counter;              
                InterlockedAdd(CounterBufferIndex[0].noticeBufferCount, 1);

                valid_slot = true;
            }
            count = count + 1;
        }
        
        if (valid_slot == false) {
            CounterBufferIndex[0].majorErrorFlagger = 3; //Set major error - 3 = notice buffer full;
            return GPUnoticeBufferSize;
        } else {
            return noticeIndex;
        }
    } else {
        CounterBufferIndex[0].majorErrorFlagger = 3; //Set major error - 3 = notice buffer full;
        return GPUnoticeBufferSize;
    }
}

Hi! The likely problem here is that the compiler tries to unroll the while loop.
Try adding [loop] before the while keyword.

Hi, thanks for the suggestion. I’ve tried [loop] and [fastopt] on all my while/for loops, but unfortunately it hasn’t helped compile times. (I’ve tried adding [loop] on both the same line as the while and the preceding line.)

I am tempted to suggest that the [loop] option is not working as expected given some basic testing:
• Original Code (1 while loop)
---- 131 seconds, 1.11mb compiled shader size

• [loop] + Original Code (1 while loop)
---- 130seconds, 1.11mb compiled shader size

• Original Code with While loop commented out (literally the 2 lines- at start and finish)
---- 60 seconds, 1.07mb compiled shader size

So while you are right that the whiles seem to be causing a lot of the issue with compiling times, I don’t seem to be able to stop it - could the fact I’m on BIRP be a problem?

.
.
Aside…

After more research I think I am also suffering from the inline nature of functions in HLSL and really need a [loop] equivalent for functions too (at least during development where a trade off from potentially huge compile times to a reasonable performance hit is fine). From a bit of research:

  • SPIR-V appears to support “DontInline” which sounds like it does what I want (I am using Vulkan for all platforms) - it’s been in the spec since at least 2019.

  • Some versions of DirectXShaderCompiler on github handle a kind of “noinline / isNoInline” statement, and has some method to use it when compiling to SPIR-V

  • But HLSL documentation suggests it does not allow noinline - (perhaps this is a versioning issue and HLSL is behind the github DirectXShaderCompiler?) - or the compilers “noinline / isNoInline” are just never meant to be used via HLSL?

  • And I’m unclear if unity uses HLSLcc or DirectXShader Compiler to cross compile to Vulkan in BIRP in v6.0 (does URP/HDRP make a difference?) but the main question is if the way unity uses either of these offers any opportunity to make use of the idea of a noinline / DontInline statement for functions in compute shaders?

If not, do any of the roadmaps for unity 6.1/7 suggest any fundamental changes /upgrades to shader compilation approach which might allow for use of some of the SPIR-V capabilities like these?

(tagging @bgolus as I’ve found lots of your shader insights very helpful in the past and seen you seem to understand a bit of how SPIR-V compilation works under the hood in unity)

It’s only a hint to the compiler. It may choose to do its own thing despite you telling it to not unroll.

No, it doesn’t matter which render pipeline you’re using.

DXC can be used, the shader needs to add a #pragma use_dxc [API list] (with API list being optional) to enable it. DXC support is considered experimental, though.
Without this pragma shaders are going to be compiled using FXC for most graphics APIs.
I don’t think there’s a way to not inline functions - the docs say functions are always inlined.

1 Like

Thanks @alekandrk - using #pragma use_dxc seems to solve my compile time issue( now ~5 second with everything in my code enabled down from ~400 seconds).

Out of interest, if I’m not making use of any of the SM6+ specific features/functions (and given the caveats on what works in the google doc) [DXC shader compiler status & doc - Google Docs] - will the compiled shader still be essentially a SM5 shader that works on any SM5 Vulkan hardware? Not that its a big issue as I can use the old compiler for production outputs anyway if not.

.

On ‘noinline’ - I agree that appears to be the official docs position, but various dev comments also seem to indicate it has been present for a while in some form (unless it was removed) - but perhaps other up/downstream elements have not been completed for it to work:
[SPIR-V] noinline attribute does not prevent function from being inlined during SPIR-V generation · Issue #3158 · microsoft/DirectXShaderCompiler
[SPIR-V] Add noinline support for SPIR-V generation by LLJJDD · Pull Request #3163 · microsoft/DirectXShaderCompiler

After some quick testing, interestingly when using #pragma use_dxc and testing [noinline] on my single most repeated function it seems to drop the compiled size from 480kb to 416kb, so perhaps ‘noinline’ is working and just not documented in the official documentation.

SM (or ShaderModel) is purely a DirectX term. If your only renderer is Vulkan, it should be fine.
Please note that DX11 doesn’t support shaders compiled with DXC.

1 Like