Performance issue compute shader with InterlockedMax

Hi, I am calculating bounds for a mesh on GPU, and during profiling I found out that this is bottleneck of my GPU operations, probably because of the atomic writes. Is there any way to speed up this operation? Here’s my very simple compute shader code:

void CreateBinMinMax(uint3 id : SV_DispatchThreadID) {
    if (id.x >= numParticles) return;
    float3 position = positions[id.x].xyz;

    InterlockedMin(minMaxCoords[0].minX, asuint(position.x));
    InterlockedMax(minMaxCoords[0].maxX, asuint(position.x));
    InterlockedMin(minMaxCoords[0].minY, asuint(position.y));
    InterlockedMax(minMaxCoords[0].maxY, asuint(position.y));
    InterlockedMin(minMaxCoords[0].minZ, asuint(position.z));
    InterlockedMax(minMaxCoords[0].maxZ, asuint(position.z));
}

Already tried many things, like playing around with thread size and stuff, to no avail. Any help would be much appreciated!

Hi!
If it’s really the bottleneck, you could try doing one InterlockedMin and one InterlockedMax, each on a uint3.
Something like

uint3 uPosition = uint3(asuint(position.x), asuint(position.y), asuint(position.z));
InterlockedMin(minMaxCoords[0].minCoord, uPosition);
InterlockedMax(minMaxCoords[0].maxCoord, uPosition);

Note that this will require changing the minMaxCoords as well.

Thanks, I found another solution, which is a bit more complicated, but mitigates the issue with groupshared varaibles, which seems to perform way better than RWStructuredBuffer for this task:

void CreateBinMinMax(uint3 id : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID) {
    if (GTid.x == 0) {
        minX_local = 1000000000;
        minY_local = 1000000000;
        minZ_local = 1000000000;

        maxX_local = -1000000000;
        maxY_local = -1000000000;
        maxZ_local = -1000000000;

    }
    AllMemoryBarrierWithGroupSync();
    if (id.x < numParticles) {
        float3 position = positions[id.x].xyz;
        InterlockedMin(minX_local, factor * position.x);
        InterlockedMax(maxX_local, factor * position.x);
        InterlockedMin(minY_local, factor * position.y);
        InterlockedMax(maxY_local, factor * position.y);
        InterlockedMin(minZ_local, factor * position.z);
        InterlockedMax(maxZ_local, factor * position.z);
    }
    //GroupMemoryBarrierWithGroupSync();
    AllMemoryBarrierWithGroupSync();
    if (GTid.x == 0) {
        InterlockedMin(minX[0], minX_local);
        InterlockedMax(maxX[0], maxX_local);
        InterlockedMin(minY[0], minY_local);
        InterlockedMax(maxY[0], maxY_local);
        InterlockedMin(minZ[0], minZ_local);
        InterlockedMax(maxZ[0], maxZ_local);
    }
}

All variables with “_local” are groupshared, and only once per group we need to interlock the RWStructuredBuffer, which greatly seems to reduce the overhead.

1 Like