ComputeBuffers working with Metal API !?

Hi, we are trying to make a tool which relies on compute shaders and works under Windows to run on Metal as well. The tool is a GPU lightmapper and uses Render Textures, Texture Arrays and Compute Buffers. We tried to test RenderTextures and they seem to work, however we can not get to properly use ComputeBuffers.

What is the current support for Compute Shaders in Metal !? Where can we read about limitations, max threads count etc… ?

Any example on Compute Shaders and Metal !?

Here is a simple example which uses suppose to initialize Compute Buffer with red color and later in C# script the data from the buffer is used to apply colors to a texture. The example is working on Windows but on Metal !?

Tried with 5.6 and 2017 as well !

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class TextComputeScript : MonoBehaviour
{
    public ComputeShader shader;

    public Texture2D tempTexture;

    ComputeBuffer   colorBuffer;
    Vector4[]       outColorBuffer;
    Color[]         colors;
  
    /// <summary>
    ///
    /// </summary>
    void Start()
    {     
        // SIMPLE EXAMPLE ON USING Compute Buffer

        tempTexture = new Texture2D(512, 512);

        // Buffer is with length of 512 x 512

        int kernel = 0;
        int totalElements = 512 * 512;              
        colors          = new Color[totalElements];
        outColorBuffer  = new Vector4[totalElements];

        // Init elements - As i have read somewhere, initializing elements before set data to buffer
        // is important for Metal ( not sure about that ) !?
        for (int i = 0; i < totalElements; i++)
        {
            outColorBuffer[i] = new Vector4();
            colors[i] = new Color();
        }

        // Create the compute buffer
        colorBuffer = new ComputeBuffer(totalElements, 16);
        colorBuffer.SetData(outColorBuffer);
        shader.SetBuffer(kernel, "colorBuffer", colorBuffer);
       
        // Dispatching kernel
        shader.Dispatch(kernel, totalElements / 32, 1, 1);

        // Read the data from the buffer back to array
        colorBuffer.GetData(outColorBuffer);

        for (int i = 0; i < totalElements; i++)
        {
            Vector4 c = outColorBuffer[i];
            colors[i] = new Color(c.x, c.y, c.z, c.w);
        }

        // The texture is not red on MAC but it is red on Windwos !
        tempTexture.SetPixels(colors);
        tempTexture.Apply();         
    }

    private void OnDisable()
    {
        if (colorBuffer != null)
            colorBuffer.Release();
    }
}

And the simple shader

#pragma kernel FillBuffer

RWStructuredBuffer<float4>        colorBuffer     : register(u0);

[numthreads(32,1,1)]
void FillBuffer (uint3 DTid : SV_DispatchThreadID)
{

    uint elementID = DTid.x;
    colorBuffer[elementID] = float4(1.0, 0.0, 0.0, 1.0);

}

I guess I already replied to your colleague in a private conversation regarding the issues you’re facing. The code above seems to work just fine but the code your colleague was using had a small typo causing it to fail.

We have some info about writing cross-platform compute shaders in here https://docs.unity3d.com/Manual/ComputeShaders.html
More accurate Metal capability info can be found from Apple’s documentation https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf

3 Likes

Yep, silly mistake unfortunately , thank You for Your time !

Just a few things I’ve found out while working with Compute Shaders on Metal/OSX maybe some of this will help.

256 threads per thread group limit, struct sizes in Compute Shaders cannot be over 2048 bytes (remember that a float is 4 bytes so you can eat that up quickly if you were trying to make some nested structs for an RWStructuredBuffer).

Note that the link pointed to above mentions 1024 on OSX I get errors with anything over 256 threads on OSX High Sierra on a MacBook Pro and it says limit of 256.

Make sure your shader is running #pragma target 4.5 as 4.6-5.0 will not work you’ll get an obscure “POSITION0 not found” error in your vertex shaders, that one I spent several hours on trying to figure out why this one wasn’t running, just needed to change it from 5.0 to 4.5 in the vertex shader: GitHub - antoinefournier/XParticle: A really simple Unity3D/DirectX 11 particle system using compute shaders to handle millions of particles.

When using DrawProcedural I pass my structs into the Vertex shaders like so:

            struct VS_INPUT
            {
                uint vertex_id : SV_VertexID;
                uint instance_id : SV_InstanceID;
            };

That way I pass in both the vertex and instance, and then pass in the rest of the information like color and normals in a buffer, standard stuff but this was actually my first go on learning shaders at all and I started learning compute shaders and working my way down to vertex/frag. Still not sure how to get surface shaders if you can at all I got lots of errors trying it.

One other thing when working on creating a data-driven Pie Chart library I wanted to render each Pie Chart in its own OnRenderObject call with a DrawProcedural but that’s not happening if you use DrawProcedural with 2 different GameObjects that use the same Compute Shader / Vertex/Frag shader it looks like. So I put it all in a “ChartManager” script on the main camera, and then get all the values from my separate GameObjects and render the Compute Shader then do my DrawProcedural on one OnPostRender call.

Still haven’t figured out there one about garbage collection warnings in ExecuteInEditMode. If I do something like this…

    private void OnRenderObject()
    {
        if (!Application.isPlaying && Application.isEditor) {
            PieMaterial.SetPass(0);
            PieMaterial.SetBuffer("quads", bufQuads);
            Graphics.DrawProcedural(MeshTopology.Triangles, data.VertexLengths.Sum() * 4 * 6, 1);
        }

It will work and draw the objects during Editor mode, which I need for my Pie Charts. However, it’s going to invoke the garbage collector automatically, so you’ll get a whole bunch of warnings in your console about ComputeBuffers not getting released. I tried doing a Release on all buffers after that DrawProcedural but that actually messed everything up in my scene. If I could just suppress those warnings somehow or find some better way of handling that situation that would be great.

That should get you through most of the gotchas that I’ve run into so far, hope any of that helps.

1 Like

Hi, thanks, it is always good to have as mush information as possible !

The 2048 bytes is the stride limit per struct instance right, and not for the whole buffer size ?
I can still have a RWStructuredBuffer with hundreds of thousands of elements in case an element size does not exceed 2048 bytes?

It’s the stride limit per element for the total of the structs. You can allocate as many elements as you need as long as memory allows.

I had this one scenario where I was wanting to pass a multidimensional array to an RWStructuredBuffer to make an Image2Ansi Art program, and that’s where I hit the struct limit and had to collapse things down to a smaller 1D array and then unpack them within in the compute shader by doing all the offset calculations manually.

Just to explain the scenario of that little program for every 9x16 pixel block of the source image was compared to every 9x16 pixel block of a font mapping image of the MS-DOS CP437 extended ASCII character (so in a loop it’s 0…8 x, 0…15 y, 0…255 ascii, 0…9 j, 0…15 k, making 5 for loops) set to find the one with the lowest match by Euclidean distance. So in this case I want a 916 struct of float3’s or float4’s to pass in, but remember that every float is 4 bytes, so that right there was 91634=1728 or 9164*4=2304 bytes per struct for either float3s or float4s respectively, the latter already over the limit per RWStructuredBuffer total struct byte size per element. The Compute Shaders are quite fast though, so much faster than CPU. My calculations on that program ran over a billion Euclidean distances in a few seconds, versus over 10 minutes on CPU based math. I could have probably squeezed the calculation time down even further with some effort.

So as long as you’re mindful of the size per struct then you’re good.

Nice, the strides of the buffers we are using are up to 128 bytes anyway, which is great, but will keep the stride limit in mind !

2048 seems quite a lot but apparently not for some fancy things :slight_smile: