I am trying to mess around with compute shaders, and having a bit of trouble setting this up.
Quick requirements:
- This is not something that needs to run in frame time.
- The shader is supposed to be run Loots of times, between 7000 - 15000 times.
- Each call takes around 0.1 MS to the shader
- GetData takes 13ms → this balloons the time immensely, since they are currently being run sequentially (this is probably mistake number one)
The compute takes around 1MB of input as 3D Texture (or 3D array) and outputs two arrays of points with around 3mb-4mb
Right now my way has been something like the following:
public void Test()
{
var compute = ....
var buffer = new ComputeBuffer(1000000, ComputeHelper.GetStride<float3>(), ComputeBufferType.Structured);
var buffer2 = new ComputeBuffer(1000000, ComputeHelper.GetStride<float3>(), ComputeBufferType.Structured);
for (int i = 0; i < 15000; i++) {
RenderTexture input = GetInput(i);
compute.SetBuffer(0, "outputa", buffer);
compute.SetBuffer(0, "outputb", buffer2);
compute.DispatchThreads(Dim.X, Dim.Y, Dim.Z);
var testDataA = new float3[1000000];
var testDataB = new float3[1000000];
bufferA.GetData(testDataA);
bufferB.GetData(testDataB);
Process(testDataA, testDataB);
}
}
This is screaming at me that I am doing something wrong.
According to my tests the bulk of the time is being spend on the GetData bits (from 15ms - 23ms) per run.All the rest finishes in under 1 ms!
My question is, what is the right right approach to do this? Can I dispatch multiple parallel compute shaders and then somewhat wait for the GetData for all of them in one go? Do I need multiple instances of the same compute to be able to do this?
Is RequestReadAsync callback any use in this case - I was reading a post and it says you can only do 3-4 Requests per frame (even though this is supposed to be run in editor most of the time)?
I cant seem to find any good examples of doing something like this and I am quite new, so any pointers would be great.
The compute shader is not that expensive to run and is working on very minimal sets of data, but I do need this data on the CPU side for a bit.