I’m working on a little project for a while - creating an area where you can spawn a bunch of grass. I’ve overcame a major headache when figuring out the GPU instancing for the mesh and the code looks like this:
public class Grass : MonoBehaviour
{
[HideInInspector]
public Mesh mesh;
[HideInInspector]
public List<Matrix4x4> matrices;
[HideInInspector]
public Material material;
private void Update()
{
Graphics.DrawMeshInstanced(mesh, 0, material, matrices);
}
}
Now that I’m done with this part I want to optimize some things. I’ve never worked on any shader besides following some tutorials for shader graphs so I’m asking for anyone’s help on writing a compute shader that would work with my code to simulate frustum culling.
Once you get past the initial weird terminology, compute shaders are actually really straightforward. I’d recommend to get comfortable with the basics, then you’ll quickly realize how to use them to perform culling (assuming the math part of it is not an issue).
This is a good starting point:
The gist of it is to spawn one thread per instance, check whether it intersects the camera’s frustum, and if it does then append it to a list. The appending to a list part can be done in a variety of ways: using an append buffer, using structured buffer + atomic increment on a counter, using structured buffer + parallel prefix sum, etc.
Also note that bringing data from the GPU to the CPU is very costly, so once your GPU culling algorithm has calculated the final list of instances to draw, you must avoid bringing it back to the CPU just to call DrawMeshInstanced: so you should use of the Indirect drawing method variants and pass your list of instance transforms as a ComputeBuffer to the shader used to do the rendering. This will avoid any roundtrips to the CPU.
Thank you for your kind response and for that article you sent - I will definitely look into that. Is there maybe one talking about calculating the frustum culling? Because I think that Unity have some predefined variables containing the planes of the camera but I’m not sure.
Another question that popped to my head after reading your answer: what is the exact difference between indirect drawing and direct? So DrawMeshInstanced vs. DrawMeshInstancedIndirect → aren’t both of those just methods for GPU Instancing a mesh thus reducing the work that needs to be done on the CPU side?
You simply need to pass the camera planes to the compute shader, then check instance bounding boxes (or bounding spheres, or any other bounding geometry you have) against all planes.
Indirect drawing means the data being drawn (instance transforms, amount of instances, etc) is not passed from the CPU to the GPU in the call: it’s already in the GPU, and just used when needed.
Okay, I’ve managed to set up a script using the compute shader above and this C# script.
public class Grass : MonoBehaviour
{
[HideInInspector]
public Mesh mesh;
[HideInInspector]
public List<Matrix4x4> matrices;
[HideInInspector]
public Material material;
public ComputeShader computeShader;
private ComputeBuffer instanceBuffer;
private ComputeBuffer visibleBuffer;
private List<float4> instances = new();
private float4[] visible;
private Matrix4x4[] visibleMatrices;
private void Start()
{
foreach (Matrix4x4 matrix in matrices)
{
instances.Add(matrix.GetColumn(3));
}
instanceBuffer = new ComputeBuffer(instances.Count, 16);
instanceBuffer.SetData(instances);
visible = new float4[instances.Count];
visibleMatrices = new Matrix4x4[instances.Count];
}
private void Update()
{
float[] planeNormals;
ConstructFrustumPlanes(Camera.main, out planeNormals);
computeShader.SetFloats("_CameraFrustumPlanes", planeNormals);
visibleBuffer = new ComputeBuffer(instances.Count, 16);
computeShader.SetBuffer(0, "instanceBuffer", instanceBuffer);
computeShader.SetBuffer(0, "visibleList", visibleBuffer);
int jobs = Mathf.CeilToInt(instances.Count / 16f);
computeShader.Dispatch(0, jobs, jobs, 1);
RenderParams rp = new RenderParams(material)
{
shadowCastingMode = ShadowCastingMode.On
};
visibleBuffer.GetData(visible);
for (var i = 0; i < visible.Length; i++)
{
var position = visible[i];
Matrix4x4 matrix = Matrix4x4.Translate(new Vector3(position.x, position.y, position.z));
visibleMatrices[i] = matrix;
}
// not rendering those outside but they're still present ?
// also: not very performant
Graphics.RenderMeshInstanced(rp, mesh, 0, visibleMatrices);
}
private void ConstructFrustumPlanes(Camera camera, out float[] planeNormals)
{
const int floatPerNormal = 4;
Plane[] planes = GeometryUtility.CalculateFrustumPlanes(camera);
planeNormals = new float[planes.Length * floatPerNormal];
for (int i = 0; i < planes.Length; i++)
{
planeNormals[i * floatPerNormal + 0] = planes[i].normal.x;
planeNormals[i * floatPerNormal + 1] = planes[i].normal.y;
planeNormals[i * floatPerNormal + 2] = planes[i].normal.z;
planeNormals[i * floatPerNormal + 3] = planes[i].distance;
}
}
private void OnDestroy()
{
instanceBuffer.Release();
visibleBuffer.Release();
}
}
It’s kind of working? I don’t know, I’m still using DrawMeshInstanced for testing purposes and it does cull meshes that are out of view.
But when I look deeply into the scene, it does get culled but the number of vertices and triangles doesn’t change and it’s not very performant. Is it because I’m not drawing it indirectly or I’m making too much load on the CPU doing all those Matrix4x4 to float4 and back? How can I modify it ?
Your code just skips copying the instance position to the “visibles” array when the original position is outside the camera, but the actual amount of instances doesn’t change so it is not culling anything: it just moves some instances to 0,0,0 - since that’s what an array of float4s will be initialized to by default, hence the position where instances will be rendered at when you don’t write their actual position to the array. It may seem like it’s working because some instances are drawn at the same spot and they overlap, but all of them are drawn nonetheless.
Also, what you’re doing is actually slower than performing culling in the CPU, since your’re stalling the pipeline by calling GetData() - this forces the CPU to wait until the GPU has finished copying all data which is really, really slow-, plus you’re then iterating trough all instances again in the CPU just to convert positions to 4x4 matrices.
Things you should fix:
Build a list that contains the positions of only those instances that pass the culling test. You can use an atomic counter for this, or a parallel prefix sum if you don’t want to use atomics. I’d recommend the atomic counter approach as it is simpler.
Don’t call GetData. Your instance data is in the GPU and that’s where it should stay, use indirect drawing to draw them. Note you should store the number of instances that remain “alive” after culling, and pass that as the argument buffer to the indirect drawcall.
If you’re not using rotations/scaling, no need to store 4x4 matrices: just the position is enough, you can simply displace the instance’s vertices in your vertex shader by adding the instance position to the vertex position. If you’re using pos/rot/scale, you can store matrices in a ComputeBuffer just fine too - no reason to restrict yourself to float4.
A small note: you’re using threadgroups with 16 threads in both X and Y, since you’re dealing with 1-dimensional data it’s simpler to just use 256 threads in the X dimension.
A counter is just a variable that you increment. Atomic means that it can safely be incremented from multiple threads.
In hlsl, you can use InterlockedAdd to atomically increment a variable.
So the idea is to have a ComputeBuffer with a single entry that acts as the counter, and every time you want to append a new instance to the “visible” list, you atomically increment the counter using InterlockedAdd and use the previous value of the counter as an index into the visibles array, writing the instance position there. This way you end up with a list that contains only visible instances, and a counter that tells you how many visible instances there are.
As far as I know, ShaderGraph doesn’t support indirect instancing as of now. You must use handwritten vertex/fragment shaders, Unity’s documentation for indirect drawing methods usually contain a sample shader that can be used with that particular method.
Okay I’ve managed to get it working with some help on the internet and actually figured out that Shader Graph does support indirect instancing, but you need to use custom functions with some hlsl code and MaterialPropertyBlock in the code. It’s working but behaves very weirdly and I can’t seem to figure it out, do you know what might be wrong with it ?
C# Script:
private Camera camera;
struct DrawData
{
public Vector3 position;
public Quaternion rotation;
public Vector3 scale;
}
[HideInInspector]
public Mesh mesh;
[HideInInspector]
public List<Matrix4x4> matrices;
[HideInInspector]
public Material material;
[Range(0, 1000f)]
public float distanceCutoff;
private List<DrawData> instances;
private ComputeShader cullShader;
private ComputeBuffer drawDataBuffer, argsBuffer, voteBuffer, scanBuffer, groupSumArrayBuffer, scannedGroupSumBuffer, resultBuffer;
private int numThreadGroups, numGroupScanThreadGroups;
private uint[] args = new uint[5];
private MaterialPropertyBlock mpb;
private void Awake()
{
camera = Camera.main;
mpb = new MaterialPropertyBlock();
instances = new List<DrawData>();
LoadInstances();
drawDataBuffer = new ComputeBuffer(instances.Count, Marshal.SizeOf<DrawData>());
drawDataBuffer.SetData(instances);
numThreadGroups = Mathf.CeilToInt(instances.Count / 128.0f);
numGroupScanThreadGroups = Mathf.CeilToInt(instances.Count / 1024.0f);
cullShader = Resources.Load<ComputeShader>("ComputeShaders/Cull");
voteBuffer = new ComputeBuffer(instances.Count, 4);
scanBuffer = new ComputeBuffer(instances.Count, 4);
groupSumArrayBuffer = new ComputeBuffer(instances.Count, 4);
scannedGroupSumBuffer = new ComputeBuffer(instances.Count, 4);
resultBuffer = new ComputeBuffer(instances.Count, Marshal.SizeOf<DrawData>());
argsBuffer = new ComputeBuffer(5, sizeof(uint), ComputeBufferType.IndirectArguments);
args[0] = mesh.GetIndexCount(0);
args[1] = (uint)instances.Count;
args[2] = (uint)mesh.GetIndexStart(0);
args[3] = (uint)mesh.GetBaseVertex(0);
mpb.SetBuffer("_DrawData", resultBuffer);
}
private void CullGrass(Matrix4x4 VP)
{
argsBuffer.SetData(args);
// Vote
cullShader.SetMatrix("MATRIX_VP", VP);
cullShader.SetBuffer(0, "drawDataBuffer", drawDataBuffer);
cullShader.SetBuffer(0, "voteBuffer", voteBuffer);
cullShader.SetVector("cameraPosition", camera.transform.position);
cullShader.SetFloat("_distance", distanceCutoff);
cullShader.Dispatch(0, numThreadGroups, 1, 1);
// Scan Instances
cullShader.SetBuffer(1, "voteBuffer", voteBuffer);
cullShader.SetBuffer(1, "scanBuffer", scanBuffer);
cullShader.SetBuffer(1, "groupSumArray", groupSumArrayBuffer);
cullShader.Dispatch(1, numThreadGroups, 1, 1);
// Scan Groups
cullShader.SetInt("numOfGroups", numThreadGroups);
cullShader.SetBuffer(2, "groupSumArrayIn", groupSumArrayBuffer);
cullShader.SetBuffer(2, "groupSumArrayOut", scannedGroupSumBuffer);
cullShader.Dispatch(2, numGroupScanThreadGroups, 1, 1);
// Compact
cullShader.SetBuffer(3, "drawDataBuffer", drawDataBuffer);
cullShader.SetBuffer(3, "voteBuffer", voteBuffer);
cullShader.SetBuffer(3, "scanBuffer", scanBuffer);
cullShader.SetBuffer(3, "argsBuffer", argsBuffer);
cullShader.SetBuffer(3, "resultBuffer", resultBuffer);
cullShader.SetBuffer(3, "groupSumArray", scannedGroupSumBuffer);
cullShader.Dispatch(3, numThreadGroups, 1, 1);
}
private void Update()
{
Matrix4x4 P = camera.projectionMatrix;
Matrix4x4 V = camera.transform.worldToLocalMatrix;
Matrix4x4 VP = P * V;
CullGrass(VP);
Graphics.DrawMeshInstancedIndirect(mesh, 0, material, new Bounds(Vector3.zero, Vector3.one * 100.0f), argsBuffer, 0, mpb);
}
private void LoadInstances()
{
instances.Clear();
foreach (var matrix in matrices)
{
instances.Add(new DrawData()
{
position = GetPositionFromMatrix(matrix),
rotation = GetRotationFromMatrix(matrix),
scale = GetScaleFromMatrix(matrix)
});
}
Debug.Log($"Initialized {instances.Count} instances of grass.");
}
private Vector3 GetPositionFromMatrix(Matrix4x4 matrix)
{
return matrix.GetColumn(3);
}
private Vector3 GetScaleFromMatrix(Matrix4x4 matrix)
{
return new Vector3(matrix.GetColumn(0).magnitude, matrix.GetColumn(1).magnitude, matrix.GetColumn(2).magnitude);
}
private Quaternion GetRotationFromMatrix(Matrix4x4 matrix)
{
float w = Mathf.Sqrt(1 + matrix.m00 + matrix.m11 + matrix.m22) / 2f;
float x = (matrix.m21 - matrix.m12) / (w * 4);
float y = (matrix.m02 - matrix.m20) / (w * 4);
float z = (matrix.m10 - matrix.m01) / (w * 4);
return new Quaternion(x, y, z, w);
}
private void OnDisable()
{
argsBuffer?.Release();
drawDataBuffer?.Release();
voteBuffer?.Release();
scanBuffer?.Release();
groupSumArrayBuffer?.Release();
scannedGroupSumBuffer?.Release();
resultBuffer?.Release();
}
Compute Shader:
#pragma kernel Vote
#pragma kernel Scan
#pragma kernel ScanGroupSums
#pragma kernel Compact
#pragma kernel ResetArgs
#define NUM_THREAD_GROUPS_X 64
struct DrawData
{
float3 position;
float4 rotation;
float3 scale;
};
RWStructuredBuffer<uint> argsBuffer;
RWStructuredBuffer<DrawData> drawDataBuffer;
RWStructuredBuffer<uint> voteBuffer;
RWStructuredBuffer<uint> scanBuffer;
RWStructuredBuffer<uint> groupSumArray;
RWStructuredBuffer<uint> groupSumArrayIn;
RWStructuredBuffer<uint> groupSumArrayOut;
RWStructuredBuffer<DrawData> resultBuffer;
float4x4 MATRIX_VP;
int numOfGroups;
groupshared uint temp[2 * NUM_THREAD_GROUPS_X];
groupshared uint grouptemp[2 * 1024];
float _distance;
float3 cameraPosition;
[numthreads(128, 1, 1)]
void Vote(uint3 id : SV_DispatchThreadID)
{
float4 position = float4(drawDataBuffer[id.x].position, 1.0f);
float4 viewspace = mul(MATRIX_VP, position);
float3 clipspace = viewspace.xyz;
clipspace /= -viewspace.w;
clipspace.x = clipspace.x / 2.0f + 0.5f;
clipspace.y = clipspace.y / 2.0f + 0.5f;
clipspace.z = -viewspace.w;
bool inView = clipspace.x < -0.2f || clipspace.x > 1.2f || clipspace.z <= -0.1f ? 0 : 1;
bool withinDistance = distance(cameraPosition, position.xyz) < _distance;
voteBuffer[id.x] = inView * withinDistance;
}
[numthreads(NUM_THREAD_GROUPS_X, 1, 1)]
void Scan(uint3 id : SV_DISPATCHTHREADID, uint groupIndex : SV_GROUPINDEX, uint3 _groupID : SV_GROUPID,
uint3 groupThreadID : SV_GROUPTHREADID)
{
int tid = (int)id.x;
int groupTID = (int)groupThreadID.x;
int groupID = (int)_groupID.x;
int offset = 1;
temp[2 * groupTID] = voteBuffer[2 * tid];
temp[2 * groupTID + 1] = voteBuffer[2 * tid + 1];
int d;
int numElements = 2 * NUM_THREAD_GROUPS_X;
for (d = numElements >> 1; d > 0; d >>= 1)
{
GroupMemoryBarrierWithGroupSync();
if (groupTID < d)
{
int ai = offset * (2 * groupTID + 1) - 1;
int bi = offset * (2 * groupTID + 2) - 1;
temp[bi] += temp[ai];
}
offset *= 2;
}
if (groupTID == 0)
{
groupSumArray[_groupID.x] = temp[numElements - 1];
temp[numElements - 1] = 0;
}
for (d = 1; d < numElements; d *= 2)
{
offset >>= 1;
GroupMemoryBarrierWithGroupSync();
if (groupTID < d)
{
int ai = offset * (2 * groupTID + 1) - 1;
int bi = offset * (2 * groupTID + 2) - 1;
int t = temp[ai];
temp[ai] = temp[bi];
temp[bi] += t;
}
}
GroupMemoryBarrierWithGroupSync();
scanBuffer[2 * tid] = temp[2 * groupTID];
scanBuffer[2 * tid + 1] = temp[2 * groupTID + 1];
}
[numthreads(1024, 1, 1)]
void ScanGroupSums(uint3 id : SV_DISPATCHTHREADID, uint groupIndex : SV_GROUPINDEX, uint3 _groupID : SV_GROUPID,
uint3 groupThreadID : SV_GROUPTHREADID)
{
int tid = (int)id.x;
int groupTID = (int)groupThreadID.x;
int groupID = (int)_groupID.x;
int offset = 1;
grouptemp[2 * groupTID] = groupSumArrayIn[2 * tid];
grouptemp[2 * groupTID + 1] = groupSumArrayIn[2 * tid + 1];
int d;
for (d = numOfGroups >> 1; d > 0; d >>= 1)
{
GroupMemoryBarrierWithGroupSync();
if (groupTID < d)
{
int ai = offset * (2 * groupTID + 1) - 1;
int bi = offset * (2 * groupTID + 2) - 1;
grouptemp[bi] += grouptemp[ai];
}
offset *= 2;
}
if (tid == 0)
grouptemp[numOfGroups - 1] = 0;
for (d = 1; d < numOfGroups; d *= 2)
{
offset >>= 1;
GroupMemoryBarrierWithGroupSync();
if (tid < d)
{
int ai = offset * (2 * groupTID + 1) - 1;
int bi = offset * (2 * groupTID + 2) - 1;
int t = grouptemp[ai];
grouptemp[ai] = grouptemp[bi];
grouptemp[bi] += t;
}
}
GroupMemoryBarrierWithGroupSync();
groupSumArrayOut[2 * tid] = grouptemp[2 * tid];
groupSumArrayOut[2 * tid + 1] = grouptemp[2 * tid + 1];
}
[numthreads(128, 1, 1)]
void Compact(uint3 id : SV_DISPATCHTHREADID, uint groupIndex : SV_GROUPINDEX, uint3 _groupID : SV_GROUPID,
uint3 groupThreadID : SV_GROUPTHREADID)
{
uint tid = id.x;
uint groupID = _groupID.x;
uint groupSum = groupID.x > 0 ? groupSumArray[groupID.x] : 0;
bool inCamera = voteBuffer[id.x];
if (inCamera == 1)
{
InterlockedAdd(argsBuffer[1], 1);
resultBuffer[scanBuffer[tid] + groupSum] = drawDataBuffer[tid];
}
}
[numthreads(1, 1, 1)]
void ResetArgs(uint3 id : SV_DISPATCHTHREADID)
{
argsBuffer[1] = (uint)0;
}
So you went the prefix sum route I see, this is considerably harder to get right than the counter approach.
Your best bet to debug this is to isolate each step (isolate the culling code, isolate the parallel scan, etc) and use GetData to visualize what the ouput of each stage is, making sure it is correct.
About indirect drawing in ShaderGraph, it is not supported . There’s hacky workarounds for it, but they only work somewhat reliably in DX11. It breaks in DX12 and Vulkan, and sometimes works in Metal (I speak from experience, I develop an asset for the store that uses this and it’s been a pain in the arse). Depending on what your goal is you might be fine, but I wouldn’t rely on this in a production environment. This feature is in Unity’s roadmap, but yet to be considered (vote for it if you want it!): https://portal.productboard.com/unity/1-unity-platform-rendering-visual-effects/c/61-support-for-drawindirect-drawprocedural
Imho the lack of official support for indirect drawing in ShaderGraph plus the practical inability to hand-write shaders for HDRP means there’s no sane way to indirectly draw anything in HDRP, which is laughable considering it is supposed to be a cutting-edge, high fidelity pipeline.
So the MaterialPropertyBlock route is just a workaround and might not work in DX12?
What step do you think breaks? It looks like only the half of the grass is visible to the camera, the rest is just glitching out… Maybe I made a mistake in calculating the clipspace ?
Well I started debugging with a small grass count and started logging the resultBuffer and indeed, it has some weird numbers. Sometimes (actually a lot of times) on position it has (-431602100.00, -431602100.00,-431602100.00), on rotation (0, 90, 90) and scale the same as position.
Also what I noticed, is that the grass is sometimes not visible to the scene camera from some angle…
ComputeBuffer memory is uninitialized by default, these weird numbers don’t really indicate that anything’s wrong as long as you’re not reading them or using them.
Note that in all of your kernels you’re reading id.x directly without checking first if it’s within the range of valid data. Eg. if you have 238 instances and your thread group size is 256, you’ll dispatch 2 thread groups for a total of 512 threads but only the first 238 threads should be used.