I am trying to fill a volume texture with some noise values via compute shaders. The threadgroupsize is 8,8,8 and the volume texture resolution is 128,64,64. So the main idea is to dispatch a compute shader with 16,8,8 so that each thread is responsible for one voxel of the texture. However it seems i cant do that with a single dispatch call. I get a black screen and then unity goes to full white. I am compelled to have a for loop in a c# script which calls dispatch 4 times( so that means i get a dispatch of 16,8,2 per iteration) and i just change the offset of the id.z ,for instance to get the next chunk of the texture.I also tried to do a for loop inside the compute shader. So with a single 16,8,2 dispatch it runs 3 extra times per thread.However i get the same issue. It seems like that if i put too much work in each thread ,it crashes. I don’t understand why that happens. Btw the gpu is 2060 if i use more threads the device allows.
Thats definitely not the source of the issue. Would need to see the code to tell anything else.
float worley1(StructuredBuffer<float3> points, float3 samplePos)
{
float minDist = 1000000.0;
for (int i = 0; i < 216000; i++)
{
if (points[i].x > 1.0)
continue;
float3 posWorld = float3(samplePos.x * (-90.0), samplePos.y * 40.0, samplePos.z * (-60.0));
float3 diff =posWorld-points[i];
float dist = sqrt(dot(diff, diff));
minDist = min(dist, minDist);
}
//return minDist;
return min(abs(minDist),2);
}
[numthreads(numThreads, numThreads, numThreads)]
void CSWorley(uint3 id : SV_DispatchThreadID)
{
//float3 index = (float3)(id.x, id.y, id.z * (i+1));
float3 pos = float3(id.x / (float)resolution.x, id.y / (float)resolution.y, (id.z+offset) / (float)resolution.z);
float noiseSum = 0;
noiseSum += worley1(points1, pos);
noiseSum += worley1(points2, pos);
noiseSum += worley1(points3, pos);
noiseSum += worley1(points4, pos);
noiseSum += worley1(points5, pos);
noiseSum += worley1(points6, pos);
noiseSum += worley1(points7, pos);
noiseSum += worley1(points8, pos);
noiseSum += worley1(points9, pos);
noiseSum += worley1(points10, pos);
noiseSum += worley1(points11, pos);
noiseSum += worley1(points12, pos);
noiseSum += worley1(points13, pos);
noiseSum += worley1(points14, pos);
noiseSum += worley1(points15, pos);
noiseSum += worley1(points16, pos);
float maxVal =32.0;
noiseSum /= maxVal;
if (invertNoise) {
noiseSum = 1 - noiseSum;
}
// keep track of min max (using int to support atomic operation)
int val = (int)(noiseSum * minMaxAccuracy);
InterlockedMin(minMax[0], val);
InterlockedMax(minMax[1], val);
// Store result in specified channel of texture
Result[uint3(id.x, id.y, id.z+offset )] = Result[uint3(id.x, id.y, id.z+offset)] * (1 - channelMask) + noiseSum * channelMask;
}
the points# buffers store the position of a point in 3d space to calculate the distance(worley noise)
you can also ignore the min and max, i have commented them and didnt fix it. They are used for another kernel to normalize the values. However i dont call the dispatch, so dont bother
for (int j = 0; j < 8; j++)
{
noiseCompute.SetInt("offset", 8 * j);
noiseCompute.Dispatch(0, 16, 8, 1); //i wanted one iteration of 16,8,8 but it crashes
minMaxBuffer.GetData(minMax);
}
Um. Holy hell, no wonder it crashes the driver, thats way too much work. You need some acceleration structures, you cant just bruteforce check each individual pair of points, running 216000*16 loop iterations on each thread, youre losing all benefits of multithreading/parallel there. Check Sebastian Lague’s tutorial on rendering clouds, he mentions generating worley noise in parallel there.
Instead of trying write own Worley noise version (unless you’re specifically learning how to write a noise on GPU) I would recommend using an existing one. There’s quite few out there in GLSL format and some in HLSL too. And it’s pretty easy to port them to HLSL anyway. Those will most likely perform quite well.
I have already seen his implementation, he uses the neighboring cells. However he has a fixed cell size and knows beforehand how to find the neighboring with a simple subtraction and addition to the index of the current voxel. In my case i don’t have any fixed size. I can maybe organize the points and accelerate it. I just did not know that you can’t put too much work. I thought you can put any load you want and you will just wait more. It doesn’t explain why it works with 8*Dispatch(16,8,1) and not Dispatch(16,8,8).
It will wait more, but up to a certain point. GPUs have something called Timeout Detection and Recovery, which basically means if something takes too much time to calculate on GPU (over 2 second by default i think), it will consider it as if GPU is stuck in an infinite loop, and therefore it shuts down and restarts the driver to fix that.
Ok i didn’t know that. Any way to bypass that, or i shouldn’t even bother changing it?
I am working now, on how to distribute the points for the worley noise. Its not total chaotic or random they are based on some data, so i don’t know the compute shader eventually might have to do much less work than the 16x216000