Noob Compute Shader questions... GetData, and multiple calls to same compute.

I am trying to mess around with compute shaders, and having a bit of trouble setting this up.

Quick requirements:

  • This is not something that needs to run in frame time.
  • The shader is supposed to be run Loots of times, between 7000 - 15000 times.
  • Each call takes around 0.1 MS to the shader
  • GetData takes 13ms → this balloons the time immensely, since they are currently being run sequentially (this is probably mistake number one)

The compute takes around 1MB of input as 3D Texture (or 3D array) and outputs two arrays of points with around 3mb-4mb

Right now my way has been something like the following:

  public void Test()
    {
        var compute = ....
        var buffer = new ComputeBuffer(1000000, ComputeHelper.GetStride<float3>(), ComputeBufferType.Structured);
        var buffer2 = new ComputeBuffer(1000000, ComputeHelper.GetStride<float3>(), ComputeBufferType.Structured);
        for (int i = 0; i < 15000; i++) {
            RenderTexture input = GetInput(i);
            compute.SetBuffer(0, "outputa", buffer);
            compute.SetBuffer(0, "outputb", buffer2);
            compute.DispatchThreads(Dim.X, Dim.Y, Dim.Z);
            var testDataA = new float3[1000000];
            var testDataB = new float3[1000000];
            bufferA.GetData(testDataA);
            bufferB.GetData(testDataB);
            Process(testDataA, testDataB);
        }
    }

This is screaming at me that I am doing something wrong.

According to my tests the bulk of the time is being spend on the GetData bits (from 15ms - 23ms) per run.All the rest finishes in under 1 ms!

My question is, what is the right right approach to do this? Can I dispatch multiple parallel compute shaders and then somewhat wait for the GetData for all of them in one go? Do I need multiple instances of the same compute to be able to do this?

Is RequestReadAsync callback any use in this case - I was reading a post and it says you can only do 3-4 Requests per frame (even though this is supposed to be run in editor most of the time)?

I cant seem to find any good examples of doing something like this and I am quite new, so any pointers would be great.

The compute shader is not that expensive to run and is working on very minimal sets of data, but I do need this data on the CPU side for a bit.

Is there a way you could move the Process into compute shader as well and only call the end result?
Even though you optimize and reduce computation (including GetData) to 1 ms (I am not an expert but it seems unlikely given the sizes of arrays and GPU to CPU transfer cost), you are calling it 15k times.
What are you trying to do with the whole process? Maybe there is a less computationally heavy way of doing it.

Rule #1: avoid bringing data from GPU memory back to CPU, and if you absolutely need to, don’t make your CPU wait for the data. Think of the GPU as a large office full of very efficient workers, but as soon as you need to bring something back from the office they send a crippled old guy that takes ages to deliver it.

In your case, the best approach is probably to move Process() to the GPU as well. This way, your data never ever leaves the GPU during processing.

Why use a compute shader then? Can’t you do this in the CPU (using jobs for multithreading, for instance)?. GPUs are designed to operate on very large amounts of data.