Compute shader stops working with large ComputeBuffer

Hi everyone

I am doing some work involving procedural texture generation. The first part of my application runs on the CPU in parallel and writes data to multiple 2D arrays of a custom Pixel struct. This is already very performant and I want to keep it on the CPU as is.

Now I simply need to combine all of the data from the separate threads to create a single image. I am working with very large images (roughly 10000x10000 pixels). Combining the pixel arrays and writing to a texture2D on the CPU is very slow. I am trying to speed this part up with a compute shader. I have got something basic working with low res images but anything above 1024x1024 pixels just returns a black texture. I have a 980-Ti (6GB VRAM), I am sure it can do better than this?

Please see the relevant code below. It's a bit messy and I've been playing around with it, but what I've got at the moment simply draws a white pixel where there is data. Any help will be much appreciated.

/// Draw to texture ///
        Stopwatch textureStopwatch = new Stopwatch();
        textureStopwatch.Start();

        int computeKernel = computeShader.FindKernel("CSMain");

        int[] tmpPixels = new int[renderJobsOutput[0].pixels.GetLength(0) * renderJobsOutput[0].pixels.GetLength(1)];
        for (int x = 0; x < resolution; x++) {
            for (int y = 0; y < resolution; y++) {
                tmpPixels[(y * resolution) + x] = renderJobsOutput[0].pixels[x, y].frequency;
            }
        }

        ComputeBuffer buffer = new ComputeBuffer(tmpPixels.Length, 4);
        buffer.SetData(tmpPixels);
        computeShader.SetBuffer(computeKernel, "dataBuffer", buffer);

        computeShader.SetTexture(computeKernel, "tex", outputTexture);
        computeShader.Dispatch(computeKernel, resolution / 16, resolution / 16, 1);

        buffer.Dispose();
public struct Pixel {
        public int frequency;
        public Color col;

        public Pixel(int _frequency, Color _col) {
            frequency = _frequency;
            col = _col;
        }
    }
#pragma kernel CSMain

RWTexture2D<float4> tex;

struct GPUPixel {
    int frequency;
    float3 col;
};

StructuredBuffer<int> dataBuffer;

[numthreads(16,16,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    int bufferID = id.x + id.y * 1024;
    float val = ((float)dataBuffer[bufferID]) / 1;
    tex[id.xy] = float4(val, val, val, 1);
}

Hi Brandon

Just to be sure, you did change that constant 1024 in the compute shader when testing with bigger buffers?
I'm sure you did, just double checking as I can't see any obvious problems in the code.

Have you tried taking a RenderDoc capture and
- checking what the compute buffer contents is there?
- checking what gets written to the output texture?

Hi Aleksi

Thanks for your help. I did change the resolution in the shader (Line 15) as well.

This is actually the first time I have heard of RenderDoc, but I will definitely try it out!

After a lot of research, I found that apparently the max size array you can send to the GPU is 1023 elements, and that it is also limited to 64kb of data. Not sure whether this is correct, perhaps you know more about the limits of buffer size? I also thought that compute buffers were different and could handle more.

I spent a lot of time on this and eventually just gave up on the compute shader approach. Maybe the compute shader approach would have been better but I am not good enough with shaders to do this at the moment. I am now using 2 separate render textures, Graphics.Blit and a fragment shader. I am uploading data to the GPU the same way using a compute buffer (Strangely it works with larger buffer sizes). However, I am now doing multiple passes and only writing a portion of the pixels at a time.

Below is my new code (Which is working, but I am still making some improvements). Hopefully it'll help someone else out in the future.

/// Draw to texture ///
        Stopwatch textureStopwatch = new Stopwatch();
        textureStopwatch.Start();

        if (computeBuffer != null) {
            computeBuffer.Release();
        }


        renderMat.SetInt("imgWidth", resolution);
        renderMat.SetInt("imgHeight", resolution);
        renderMat.SetVector("backgroundCol", new Vector4(backgroundColour.r, backgroundColour.g, backgroundColour.b, 1));

        RenderTexture renderTexA = new RenderTexture(resolution, resolution, 24);
        RenderTexture renderTexB = new RenderTexture(resolution, resolution, 24);
        renderTexA.enableRandomWrite = true;
        renderTexB.enableRandomWrite = true;
        renderTexA.autoGenerateMips = false;
        renderTexB.autoGenerateMips = false;
        renderTexA.useMipMap = true;
        renderTexB.useMipMap = true;
        renderMat.SetTexture("_MainTex", renderTexA);


        computeBuffer = new ComputeBuffer(resolution * resolution * THREAD_COUNT, 20);
        Pixel[] linearPixel = new Pixel[resolution * resolution * THREAD_COUNT];
        for (int x = 0; x < resolution; x++) {
            for (int y = 0; y < resolution; y++) {
                int f = 0;
                float r = 0;
                float g = 0;
                float b = 0;
                for (int i = 0; i < THREAD_COUNT; i++) {
                    try {
                        Pixel cPixel = renderJobsOutput[i].pixels[x, y];
                        f += cPixel.frequency;
                        r += cPixel.col.r;
                        g += cPixel.col.g;
                        b += cPixel.col.b;
                    } catch {

                    }
                }
                r /= THREAD_COUNT;
                g /= THREAD_COUNT;
                b /= THREAD_COUNT;
                linearPixel[x * resolution + y] = new Pixel(f, new Color(r, g, b));
            }
        }

        RenderTexture activeRenderTexture = renderTexA;
        RenderTexture otherRenderTex;



        int copyIndex = 0;
        // Update size in shader as well
        // 2048 x 2048
        int transferSize = 4194304;
        if (transferSize > resolution * resolution)
            transferSize = resolution * resolution;
        Pixel[] subArray = new Pixel[transferSize];
        while (copyIndex < resolution * resolution) {
            int cLength = transferSize;
            if (copyIndex + cLength >= resolution * resolution) {
                cLength = (resolution * resolution) - copyIndex - 1;
            }

            for (int i = 0; i < cLength; i++) {
                subArray[i] = linearPixel[copyIndex + i];
            }


            computeBuffer.SetData(subArray);
            renderMat.SetBuffer("pixels", computeBuffer);

            renderMat.SetInt("startIndex", copyIndex);

            otherRenderTex = (activeRenderTexture == renderTexA) ? renderTexB : renderTexA;

            Graphics.Blit(activeRenderTexture, otherRenderTex, renderMat);
            copyIndex += transferSize;

            renderMat.SetTexture("_MainTex", otherRenderTex);
            activeRenderTexture = otherRenderTex;

            outputImage.texture = activeRenderTexture;

            yield return null;
        }
        textureStopwatch.Stop();
        ///
Shader "Custom/RenderToTexture"
{
    Properties
    {
        _MainTex("InputTex", 2D) = "white" {}
    }
    SubShader
    {
        Pass
        {
            CGPROGRAM
            #pragma target 3.5

            #pragma vertex VSMain
            #pragma fragment PSMain

            struct GPUPixel {
                int frequency;
                float4 col;
            };

            sampler2D _MainTex;

            StructuredBuffer<GPUPixel> pixels;
            int imgWidth;
            int imgHeight;
            float4 backgroundCol;

            int startIndex;

            void VSMain(inout float4 vertex:POSITION,inout float2 uv : TEXCOORD0)
            {
                vertex = UnityObjectToClipPos(vertex);
            }

            float4 PSMain(float4 vertex:POSITION,float2 uv : TEXCOORD0) : SV_TARGET
            {
                int x = int(floor(uv.x*imgWidth));
                int y = int(floor(uv.y*imgHeight));
                int index = x * imgWidth + y;
                if (index >= startIndex && index < startIndex + 4194304) {
                    float4 colOutput = pixels[x*imgWidth + y - startIndex].col;
                    float frequency = pixels[x*imgWidth + y - startIndex].frequency;
                    colOutput = lerp(backgroundCol, colOutput, clamp(frequency / 1, 0, 1));
                    return colOutput;
                } else {
                    return tex2Dlod(_MainTex, float4(uv.xy, 0, 0));
                }
            }
            ENDCG
        }
    }
}

Array size limit is not the same as the Structured Buffer limit (afaik). You can have millions of data items in a Structured Buffer.

I don't have time to read your code but there's probably something sideways if you got it all black. Some size or Dispatch doesn't match.

Thanks Olmi. Good to know that at least it is possibly with compute shader.

Like I said, I'm quite new to compute shaders. I really have tried doing a lot of research but just can't work it out. I would really appreciate it if someone could point out where I went wrong.

@BrandonK I can write a quick example about similar case a bit later after I get back to my workstation.

I tested quickly that this works.

I'm not sure how you use the textures and/or what formats you use.
But this one has integer structured buffer and it works correctly.
I did some modifications to the value to actually visualize them on the screen.
I tested buffer sizes up to 8192x8192. Seems to work correctly.

#pragma kernel CSMain
RWTexture2D<float4> tex;
struct GPUPixel {
    int frequency;
    float3 col;
};
StructuredBuffer<int> dataBuffer;
[numthreads(16,16,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    int bufferID = id.x + id.y * 4096;
    int intVal = dataBuffer[bufferID];
    float x = (float)(intVal >> 16);
    float y = (float)(intVal & 0xffff);
    tex[id.xy] = float4(x  / 4096.0f, y / 4096.0f, 0.0f, 1);
}
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using System.Diagnostics;
using UnityEngine.Experimental.Rendering;

public class TestScript : MonoBehaviour
{
    public ComputeShader computeShader;
    public RenderTexture outputTexture;
    private RenderTexture renderTexture;

    public struct Pixel {
        public int frequency;
        public Color col;
        public Pixel(int _frequency, Color _col) {
            frequency = _frequency;
            col = _col;
        }
    }

    public const int resolution = 4096;
    public Pixel[,] pixels = new Pixel[resolution, resolution];

    // Start is called before the first frame update
    void Start()
    {
        renderTexture = new RenderTexture(resolution, resolution, 1, GraphicsFormat.R8G8B8A8_UNorm);
        renderTexture.enableRandomWrite = true;
        renderTexture.Create();
    }



    // Update is called once per frame
    void Update()
    {
        /// Draw to texture ///
        Stopwatch textureStopwatch = new Stopwatch();
        textureStopwatch.Start();
        int computeKernel = computeShader.FindKernel("CSMain");
        int[] tmpPixels = new int[pixels.GetLength(0) * pixels.GetLength(1)];
        for (int x = 0; x < resolution; x++) {
            for (int y = 0; y < resolution; y++)
            {
                tmpPixels[(y * resolution) + x] = (x << 16) | y;//pixels[x, y].frequency;
            }
        }
        ComputeBuffer buffer = new ComputeBuffer(tmpPixels.Length, 4);
        buffer.SetData(tmpPixels);
        computeShader.SetBuffer(computeKernel, "dataBuffer", buffer);
        computeShader.SetTexture(computeKernel, "tex", renderTexture);
        computeShader.Dispatch(computeKernel, resolution / 16, resolution / 16, 1);
        buffer.Dispose();

        Graphics.Blit(renderTexture, outputTexture);
    }
}

5934326--634859--compute_repro.jpg

Thank you @AleksiUnity so much for your efforts.

Very clever how you pass the xy coordinate to the compute shader! I have tried your code, it works fine on my computer as well. The problems seem to start when I actually sample from the 2D array. Perhaps it is an issue with how my data type is passed to the GPU - Pixel (cs) and GPUPixel (shader) ?

I have spent the whole morning trying to fix this and still can't get it working. Once again it works with all resolutions up to and including 1024, anything above that and I just get black image. Extremely frustrating, I am starting to think that perhaps it is a system issue. I have even tried changing registry settings at (Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers) to increase tdrDelay.

I just can't understand what changes between resolutions 1024x1024 and 2048x2048 that causes this issue. It will still draw fine at 2048x2048 on the CPU code, so it must be an issue with the shader or buffer.

Unfortunately, I cannot share my entire project, but I will share the relevant code again. renderJobsOutput[0].pixels is a 2D array of type Pixel. I can guarantee that it is correctly populated with data for all resolutions, the frequency variable creates something like a heightmap. I have attached a screenshot of a mask of the frequency variable.

/// Draw to texture ///
        Stopwatch textureStopwatch = new Stopwatch();
        textureStopwatch.Start();

        // Create new rendertexture if resolution is different
        if (resolution != lastRenderTextureResolution) {
            lastRenderTextureResolution = resolution;
            outputTexture = new RenderTexture(resolution, resolution, 24);
            outputTexture.enableRandomWrite = true;
            outputTexture.Create();
        }

        int computeKernel = computeShader.FindKernel("CSMain");

        // Pixel data
        ComputeBuffer dataBuffer = new ComputeBuffer(resolution * resolution, 20);
        dataBuffer.SetData(renderJobsOutput[0].pixels);
        computeShader.SetBuffer(computeKernel, "dataBuffer", dataBuffer);

        computeShader.SetTexture(computeKernel, "tex", outputTexture);
        computeShader.Dispatch(computeKernel, resolution / 16, resolution / 16, 1);

        dataBuffer.Dispose();
#pragma kernel CSMain

RWTexture2D<float4> tex;

struct GPUPixel {
    int frequency;
    float4 col;
};

StructuredBuffer<GPUPixel> dataBuffer;

[numthreads(16,16,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    // Change number below with resolution in C# script
    int bufferID = id.x + id.y * 1024;
    // Test output with simple mask of frequency
    float val = ((float)dataBuffer[bufferID].frequency) / 1;
    tex[id.xy] = float4(val, val, val, 1);
}

5935094--634976--Capture.JPG

Hi

I would still suggest to take that RenderDoc capture. It might sound intimidating, but it's really easy and you can then verify all your assumptions about what happens on the GPU side.

https://docs.unity3d.com/Manual/RenderDocIntegration.html

The capture icon has changed to camera in my version at least. So I suppose we need to update the manual.
If you don't have RenderDoc installed, you can find the latest version here: https://renderdoc.org/

Once you have the capture, you can find your compute shader from the beginning of the capture.

So check what you have in the structured buffer when you have bigger than 1024 sizes.
And check what it writes to the RenderTexture.

This way you can at least pinpoint the issue to CPU or GPU problem and also if it's an issue with corrupt data in the structured buffer or an issue with writing to the RenderTexture.

5938640--635528--testcapture.png