Help with bug in compute shader

First off I am relatively new to using compute shaders in Unity, but I overall am an experienced dev.

I am creating a particle simulator. The main logic to compute the particle’s velocity happens in a compute shader. The issue is that randomly some particles will not get updated for a few frames (see GIFs). When I copy the logic to run in a c# script it works as expected.

My thought is that the bug must be in the compute shader. Specifically, I thought it might be caused by a race condition of updating the particle velocity from the shader. I have tried to use GroupMemoryBarrier() and GroupMemoryBarrierWithGroupSync() in the shader (around line 49) but that did not work. My other thought was that it may be a bug related to floating point math as I know GPUs can handle that differently than a CPU

I am stuck as to what could be the issue, or what I could do to better debug the issue?

Here is an example of the behavior where each particle is set to repulse all particles in its range.

Bugged Code (Compute shader):
8926994--1223246--Bugged-GPU.gif

Correct behavior (CPU):
8926994--1223249--Correct-CPU.gif

ComputeShader pseudo code. I replaced some of it with comments to make it more readable:

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
  
    // if the current container is empty, skip it
    if (containers[id.z].length == 0)
    {
        return;
    }

    int containerIndex = id.z;

    // set of neighbors
    int neighbors[9];
  
    // SET UP NEIGHBORS

    // loop through the neighbors
    for (int i = 0; i < 9; i++)
    {
        int neighborIndex = neighbors[i];

        Container c1 = containers[containerIndex];
        Container c2 = containers[neighborIndex];

        if (c1.length == 0 || c2.length == 0)
        {
            continue;
        }
        if (c1.length <= id.x || c2.length <= id.y)
        {
            continue;
        }

        Particle p1 = particles[c1.offset + id.x];
        Particle p2 = particles[c2.offset + id.y];

        if (p1.objID == p2.objID || p1.objID == 0 || p2.objID == 0)
        {
            continue;
        }

        // GET DISTANCE SQUARED ACCOUNTING FOR SCREEN WRAP

        if (distanceSquared < 400*400)
        {
            // CALCULATE FORCE

            particles[c1.offset + id.x].velocity += forceVector * force;
        }
    }
}

The shader is called as follows:
computeShader.Dispatch(0, (numParticlesPerContainer - 1) / 8 + 1, (numParticlesPerContainer - 1) / 8 + 1, numContainers);

If you need any more info to help please ask.

This is a sync issue. You are looking at neighbors at the same time as those neighbors are being updated, so the results will be unpredictable.

The most straightforward way to solve this is via double-buffering. Have 2 compute buffers. Read from one, and write to the other, then swap them every frame. That way, you’re not reading from and writing to the same buffer at the same time.

1 Like

I thought of that, but after implementing it the behavior was the same. The only write operation is to the velocity property of the particles. However, I am never reading that property in the shader, so I’m not sure how that could cause a sync issue. Each particle will have its velocity written to multiple times though.

I wanted to use some function like interlockedAddFloat() for atomic addition, but there is no such function for floats.

Ah, I see. Specifically, this line can refer to the same address on different kernel invocations?

particles[c1.offset + id.x].velocity += forceVector * force;

Yes, that’s race condition. You are reading the value – it needs to fetch the velocity, do the add in the shader units, and write that back. Note on a GPU, there can be over 800 cycles latency when fetching from main memory, not counting any writes. To put that in perspective, you have maybe 50 cycles worth of math shown there… to say memory latency can be a problem on a GPU is an understatement. So the driver likely starts fetching that value long before the shader even starts running, and writes it back much later – and there may be different versions of that value in the caches on different execution units.

Is there a way to quantize the data to a uint? Maybe store int(round(velocity*4096))? That way you can use InterlockedAdd functions. They say they require SM5, though.

Quantizing the data to an int seems to have done the trick. I wish there was an InterlockedAddFloat function.
Thanks!