Sending Data to GPU, ComputeShader/Buffer, MaterialProperty and Hlsl Script

I’am currently have some Questions about data structure and how can i send only the minimum i need to the gpu and whats about the other things. (I’am using the URP)

1. I need to send 2 values each between 0 and 255.

So there are small types:
byte = 1
short = 2
int = 4
uint = 4

But byte can’t send to the GPU right, so the minimum i can use is short?

2. I’am using DrawMeshInstancedProcedural some Questions about
Mesh as Quad, Material all OK:

SubMeshes don’t need that if i set them to 0, it’s okay or ist there a way to remove them complete?

Bounds i change them to zero or any other value, nothing changed when i hit play?

BufferCount, if i sending a larger Buffer but only filled half, what is used?

MaterialPropertyBlock only for SetBuffer if i use one StructuredBuffer for all my things, is there a way to send it directly without MaterialPropertyBlock?

3. There are 2 Files
HLSL File, than can used to implement some code into the ShaderGraph right? (.hlsl File)
ComputeShader to let the Gpu calculate some things, with Kernel thing and numthreats (.compute File)

How to get the perfect number of threats to a ComputeShader?

No my Question is, is it usefull to use HLSL File and ComputeShader together on ShaderGraph or a Shader?
Is there some Performance Difference?

If iam clear about this all i can write my best option to the ground.
Hope someone can explain me some things.
Thank you all :smile:

Edit:

2. BufferCount i found out that using a large ComputeBuffer but filling it only with some content, impacts the performance very heavy.

It depends on how you’re sending data. But generally you can’t guarantee the GPU will recognize anything but 32 bit types: int, uint, float. HLSL has no concept of byte or short. There are fixed and half variable types, but they’re defined as a signed floating point value that “can hold a value between -2 and 2 with a precision of at least 1/256” or is “at least 16 bits”, and a 32 bit float fulfills the requirements for both, so most GPUs just use that.

The easiest solution to passing two bytes to shaders is … don’t. Just pass two float values. You can define them as int or uint in the shader file, and there’s even a material.SetInt() function, but it’s a lie. Under the hood Unity casts that int to a float when you call SetInt(), then casts it back to an int or uint depending on what the shader wants.

However if you’re passing a lot of values via a compute buffer, you can take advantage of the fact c# does support the byte variable type, and that the compute buffer is passed to the GPU as ray bits that can be interpreted any way you want.

// c#
// create compute buffer
ComputeBuffer cb = new ComputeBuffer(numObjects, 2); // 2 bytes
// important: numObjects needs to be an even number

// struct of two bytes
public struct TwoBytes
{
    public byte a;
    public byte b;
}

// create array of bytes
TwoBytes[] data = new TwoBytes[numObjects];

// set the data in the array
for (int i=0; i<numObjects; i++)
{
    data[i].a = //object byte value A
    data[i].b = //object byte value B
}

// copy data in array into the compute buffer
cb.SetData(data);

// pass it to the shader calling SetBuffer() where appropriate
// shader code
StructuredBuffer<uint> _Data; // yes, a 32 bit uint, not a struct, not bytes

uint2 GetDataAtIndex(uint index)
{
    // real index is half of input index because shader is working with 32 bit uints and not bytes
    // this means the two bytes per index are packed into the first 16 and last 16 bits of the 32 bit uint
    uint realIndex = index / 2;
    uint packedData = _Data[realIndex];

    // bit shift over 16 bits if we're trying to get the odd index
    if (index % 2 == 1)
        packedData = (packedData >> 16);
   
    return uint2(
        (packedData >> 0) & 0xF, // extract the first byte
        (packedData >> 8) & 0xF // extract the second byte
        );
}

If you’re looking to pack this into an existing struct with other data in it, you’re likely best off just padding out the struct to keep it byte aligned to 32 bits.

// c# struct
public struct MyDataStruct
{
    public Vector3 position;
    public byte a;
    public byte b;
    public short padding;
} // sizeof(MyDataStruct) == 16

// hlsl struct
struct myDataStruct {
    float3 position;
    uint packedData;
}; // "get data" function just uses the last 2 lines to unpack

You always need at least 1 submesh. A mesh with zero submeshes is a mesh with no data.

Bounds are used by MeshRenderer components for CPU side frustum and occlusion culling. When you use DrawMeshInstancedProcedural(), and several of the similar functions, you’re telling Unity to skip any of that and you’re handling it yourself, especially since the position data you’re passing in might not even ever by known on the CPU side.

Junk data. Hopefully zeros, but I don’t know if it’s guaranteed.

That’s about as direct as you get. You could call SetBuffer() on the material directly, but if you’re rendering multiple sets of meshes with the same material you’ll want to use the property blocks.

I think we’d all wish we knew that answer.

Shader Graph is a shader generator. It spits out HLSL shader code that is otherwise nearly identical to what you could write by hand when writing a vertex fragment shader. The advantage of Shader Graph is it “just works” with the lighting systems without you haven’t to do anything.

Writing a vertex fragment shader by hand may produce slightly more efficient / faster shader code as you can be very explicit about making sure the shader only does the things you need it to, but most of the time it won’t be a significant difference.

However you can’t use a compute shader with Shader Graph, not directly. You can run a compute shader to generate data that you store in a compute buffer, then use that buffer with a Shader Graph that’s has a Custom Function node pointing at an HLSL file that accesses that buffer to extract the relevant data. But you can’t include a compute shader into a Shader Graph. And at this time you can’t create compute shaders using Shader Graph.

6 Likes

First of all, I thank you for the detailed explanation.
I tried something yesterday evening and first tried to send only the Int as an index to the GPU.
Seen here, i got some errors and a strange problem.
https://discussions.unity.com/t/863855

First i need to convert the index to a float4x4 for the Matrix of the Mesh, if this is finally done, i can change from int to byte script you put in here, thank you for that this is very usefull in my case.

But i allready put a “0” into that field and all works fine?

Graphics.DrawMeshInstancedProcedural(mesh, 0, material, bounds, buffer.count, propertyBlock);

So i can ignore the bounds, because iam only sending that data, that i want to see?

I’am useing only 1 type of Mesh (Quad) so i test it out how is the difference, thank you for that fact.

i found that on this forum, i think it can be a beginning.

Remember that the numbers you pass to Dispatch() are the amount of groups, not threads. If you want to process 4096 items and your kernel group size is (128, 1, 1), you need to call Dispatch(32, 1, 1).

My way to do this is, creating the data on cpu, then the computeShader should convert or doing some math on that data, this data then is used in the textfile (.hlsl) inside ShaderGraph to finally let the shader do its thing.

I’m a little smarter now than I was before.

Ah! I misunderstood the question! I was thinking about the settings on the mesh you passed to the DrawMeshInstancedProcedural(), not the actual parameters of that function!

Let’s do this again.
submeshIndex: You want it to be 0 because that’s the first submesh in the mesh. If you were using a mesh with multiple materials you’d need to call DrawMeshInstancedProcedural() multiple times, once for each submesh. The quad mesh just has one submesh.

bounds: This does need to be a position that’s in view of the camera. If the world origin is in view, a zero bounds will work. While the individual objects won’t get frustum culled automatically, the entire draw mesh call might be.

It’s less about if you’re using one mesh and more if you’re calling DrawMeshInstancedProcedural() multiple times per frame reusing the same material. You’d need a unique material per DrawMeshInstancedProcedural() call.

1 Like

I implemented it to my code but i got the error:
Invalid stride 2 for Compute Buffer - must be greater than 0, less or equal to 2048 and a multiple of 4.

i dont know that the stride of a computeBuffer should be minimum 4?

Ah, yeah. I guess Unity “knows” that the GPU can only interpret 32 bit variables. I was trying to sidestep that by using a stride of 2 and having you make sure you use an even numObject count.

Just means you’ll have to deal with some of the logistics of the values actually being packed on the C# side as well.

You’d have to use a struct with 4 byte variables in it with “Object i+0” and “Object i+1” represented.

1 Like
// c#
int bufferSize = Mathf.CeilToInt((float)numObjects / 2f);
// create compute buffer
ComputeBuffer cb = new ComputeBuffer(bufferSize , 4); // 4 bytes

// struct of two bytes
public struct TwoTwoBytes
{
    public byte a0;
    public byte b0;

    public byte a1;
    public byte b1;
}

// create array of bytes
TwoTwoBytes[] data = new TwoTwoBytes[bufferSize];

// set the data in the array
for (int i=0; i<numObjects; i+=2)
{
    data[i].a0 = //object i byte value A
    data[i].b0 = //object i byte value B

    data[i].a1 = //object i+1 byte value A
    data[i].b1 = //object i+1 byte value B
}

The shader code would be unchanged.

1 Like

For now i got no errors, but the DrawMesh isn’d drawing anything?

    uint2 GetDataAtIndex(uint index) {
        // real index is half of input index because shader is working with 32 bit uints and not bytes
        // this means the two bytes per index are packed into the first 16 and last 16 bits of the 32 bit uint
        uint realIndex = index / 2;
        uint packedData = _Indexes[realIndex];

        // bit shift over 16 bits if we're trying to get the odd index
        if (index % 2 == 1)
            packedData = (packedData >> 16);

        return uint2(
            (packedData >> 0) & 0xF, // extract the first byte
            (packedData >> 8) & 0xF // extract the second byte
            );
    }

void ConfigureProcedural () {
    #if defined(UNITY_PROCEDURAL_INSTANCING_ENABLED)
        uint2 i2 = GetDataAtIndex(unity_InstanceID);
        int i = i2.x;
        int y = i / (128 * 128);
        int x = (i - y * 128 * 128) / 128;
        int z = i - y * 128 * 128 - x * 128;
        int d = (i % 6);
        float3 v = float3(x, y, z); + DirectionVector[d];
        float3x4 m = float3x4(rot1[d], rot2[d], rot3[d], v);
        //unity_ObjectToWorld = m;
    
        //float3x4 m = _Matrices[unity_InstanceID];
        unity_ObjectToWorld._m00_m01_m02_m03 = m._m00_m01_m02_m03;
        unity_ObjectToWorld._m10_m11_m12_m13 = m._m10_m11_m12_m13;
        unity_ObjectToWorld._m20_m21_m22_m23 = m._m20_m21_m22_m23;
        unity_ObjectToWorld._m30_m31_m32_m33 = float4(0.0, 0.0, 0.0, 1.0);
    #endif
}

Is this correct? i’am added a TwoBytes of 4 (a, b, a1, b1) like you posted above.
i’am curently let “b” and “a1, b1” out of the calculation. B is for the index of the TextureAtlas, this comes later into game.

The Buffer has now a smaller size then before sending float3x4 with stride of 48. Thats great if i got it working xD

I i want to use the b1 and a1 values later for other things, how should i extract this inside the shader? Because A was inside the first 16 bits and B is inside the last 16 bits, but where is a1 and b1?

I got it working, by changeing the matrix.

Update [SOLVED]: I got it working right, i make a misstake at Matrix.

        unity_ObjectToWorld._m00_m01_m02 = rot1[d] / 1.0; //Size x
        unity_ObjectToWorld._m10_m11_m12 = rot2[d] / 1.0; //Size y
        unity_ObjectToWorld._m20_m21_m22 = rot3[d] / 1.0; //Rotation / Size z

        unity_ObjectToWorld._m03_m13_m23 = v; //Position
        unity_ObjectToWorld._m33 = 1.0; //Immer 1.0 for Projection

That’s the correct Matrix now.

Some Questions about the packing of bytes, if i have 6 Bytes and want to send them to the gpu, where are is each byte packed?

I’m confused by the question, because you aren’t giving enough information to be able answer.

But think about it this way. The actual layout of the structs on the c# and hlsl sides do not actually matter that much. They certainly don’t need to match. The CPU is passing a stream of bits to the GPU which is being interpreted in wherever way you want. That’s how the “TwoTwoBytes” example worked. It was passing an array of 32 bits that on the CPU was struct of 4 byte values, and on the GPU was an array of uint values. You could pass an array of 6 byte value structs, and then interpret that as an array of uints still, where 4 values are packed into one uint, and 2 more are packed into the start / end of the next / previous uint. Though depending you might start looking at other ways of packing, especially if the values don’t use the full 0-255 range a byte gives you. For example you’re only use 6 values for the orientation. Assuming you’re looking to separate those out from the position, that only needs 3 bits to store. So you could directly bit pack a uint on the CPU and potentially get all 6 values you need in that (depending on the precision you need for each).

If this is my struct each value can be 0 to 255.
I’am currently need something like this for testing, i hope i can avoid this.

    struct MoreBytes {
        byte a, b, c, d, e, f;
    }

My main problem is loading the world each frame if player position updates, thats impact my performance to much.

Generating and Loading Once gives me >300Fps but i need a fast way to get all voxels in renderDistance and put them in the buffer.

In my case: 1 Voxel has 6 quads, only the quads where the neighbor is air, are shown, i got it working to precalulate the quads once. Now all data is precalculated once.

The best option for me can be a List of Chunks and each chunk has a Array of voxeldata, but nasted Arrays in Burst ist not allowed :confused:

Is there a good way to get all positions (rounded like 0,0,1 or 10,2,3) at player location, and doing .Copy inside BeginWrite to the ComputeBuffer?

That sounds more like a data management problem rather than anything directly related to rendering.

And no, there’s no way to efficiently get only positions around you to copy out of a basic array. This is why things like Morton Z ordering are a thing, as well as data partitioning. The simplest solution is to break up your world into larger chunks, with each chunk having a pre-calculated AABB bounds. Test against those bounds, and then either render each chunk separately, or copy them into a larger flat array only when you need to change what you’re rendering.

This looks like my problem solver, but how can i create a flatten 3d array with Z ordering?

Another solution that can work for me would be:

  1. I need all positions that the player sees like a “field of view”
  2. Then i convert them to the indexes for each position

Now comes the point where I can’t get any further.

  1. Copy individual values into the ComputeBuffer (like indexes, but never copy byte or int that is equal to 0)

Some pseudo code:

NativeArray<T>.Copy(data, firstIndex, bufferData, 0, amount) where i > 0;

If i got this working, i’am finish with all generating and loading :smile:

There isn’t really any solution to frustum culling that isn’t going through the list one by one and adding the entries that pass the visibility to a new list.

However you might want to look into doing it on the GPU instead of on the CPU.

I’ve got it working with Chunk System inside NativeMultiHashMap<int, uint2>, the loading is very fast.
My current problem is:

i store the whole world inside the NativeMultiHashMap, but i got the message:

“Attempted to operate on {size} bytes of memory: nonsensical”

where is the limit in capacity?