Using compute shaders for GPU multi-threading & Physics calculations

Hello,

in a recent effort of gaining performance when moving 5000 enemy units (gameObjects, RTS like game) through one manager script (DOD style), i tried different approaches, including: JobSystem+Bursty, and GPU multithreading with HLSL compute shaders.

Trying both options, i found the compute shader approach pretty interesting- to me, this was not only far easier to implement than the JobSystem (where i still struggle with), but the compute shader approach worked right out of the box and seems also very clean and small in terms of code. So with that, i already have successfully offloaded my position calculations to different GPU cores.

GPU multithreading via HLSL compute shaders seems like a potential and powerfull alternative to CPU multithreading!

…

Problem:
So in the current state there is the DOD manager script which passes in all current unit positions in one batch, into the compute shader and uses multiple GPU cores to calculate the new unit positions. It then passes the new positions back into the C# manager (CPU buffer?). People said this eats up performance again, but currently i only see one problem, which causes all performance gains to be lost again: the problem is, i still have to apply the new positions to the gameObjects in a for loop, itering over the array with the new positions and applying the new position to each unit (transform). So in this loop we have 5000 * “unit.transform.position = newPos*”, which eats about 30fps.*

Question:
I know this is where JobSystem would shine. But isn’t there a way to move the units through the GPU, so that the compute shader does not even have to pass back the new positions into the c# unit manager?

Demonstration of what GPU power looks like:
(UEBS2, shows millions of 3D animated units in unity):https://youtu.be/kpojDPlIjdQ?t=56 (with time marker)
This proofs that there must be a way to move units via GPU. Maybe they are not GameObjects, but they must have something like a transform, which the developer of above linked video seems to have figured out.

EDIT: after posting, i answered the following questions with ChatGPT. Answers marked green. Still interested if anyone has additional input.

Final thoughts:

  • How to get a “bridge” for C# Mono GameObjects, to GPU compute shader (HLSL)?
    EDIT: the answer here is Unity’s compute Buffer API, which i have been using.

  • Is there even a possible way that HLSL compute shaders or GPU can ‘talk’ to MonoWorld GameObjects? EDIT: again Unity’s compute Buffer API

  • Currently, i think the only way may be to have the units fully managed on GPU, so no mono objects exist, as GPU cant access regular (CPU) RAM without buffering back. Or is there a way to get around the drawback of buffering back from GPU to CPU?
    EDIT: This approach seems right (managing units fully on GPU and using GPU instancing). For talkback from GPU to C#, one can use Unity’s compute Buffer API.

  • This approach seems extremly powerfull, why did Unity not look further into this yet? We already have compute shaders which (if i understood correctly) are exactly for such calculations (Physics for example), if unity could provide a bridge (API) to access GameObjects in MonoWorld (from within the compute shader), then that would open a massive gate for new architectural options! EDIT: here again, the answer here is Unity’s compute Buffer API.

I am interested if some of you guys have stumbled upon the same problem, or if someone maybe has found a solution to this.

Thanks,
Regards!

Don’t you just love when you rarely get an answer to your questions if at all? Yeah I’ve had the same experience, so I’ve ended up just having to figure a lot of things out myself.

Recently, I’ve been trying to find a way to offload physics calculations onto the GPU as I want to create games with fully explorable and fully destructible buildings. The Jobs system was a nightmare to work with at first, but it becomes quite nice once you get the hang of it depsite the lack of documentation on how to do stuff. I started an implementation of my game ideas by creating a multithreaded destruction system using DOTS, but the performance from Havok physics plus my optimizations, while much better than PhysX from past versions of Unity, still isn’t performant enough for what I want to achieve. I need an extremely high rigidbody budget (in the hundreds of thousands) with good performance.

My guess is that to achieve what you want, you’d have to do something similar to how this dude moves the positions of rigidbodies directly on the GPU: GitHub - jknightdoeswork/gpu-physics-unity: Through this configuration, no per voxel data is transferred between the GPU and the CPU at runtime.

1 Like

Because this isn’t how hardware works. The GPU can’t really read/write directly to sysram and must pass data back to the CPU for it to do so.
And trying to make an easily customizable architecture around using compute shaders for that would be a mess due to platform and hardware differences, and the functional limitations of GPU compute in general.

You’ve answered your own question here as to what would be the correct approach… Use the JOB system (more specifically, DOTS), because it gives you a much more performant alternative to MonoBehaviours, which is where the biggest slow down come from for modifying lots of units, it gets rid of using them, instead using extremely efficient vectorized and organized memory collections that the GPU can rip through blazingly fast. So I’m not sure why you didn’t just explore this avenue. Computing the collisions with this system would have also seen you get massive speed up, without having to transfer huge chunks of data to the GPU each frame, which is going to have even more noticeable impact on lower bandwidth platforms.

We shouldn’t try to shove EVERYTHING onto the GPU, that’s wasteful, you’ll make your game heavily GPU bound and let a valuable hardware resource (CPU) go to waste without much work to do. It’s about finding a good balance of workload between the two, that is where maximum performance is found, but that means also doing things efficient on the CPU (by using DOTS) instead of just saying "the default isn’t working well, so I’ll use the GPU instead)

The issue is the extremely slow speed of memory transfers from GPU to CPU. The CPU and GPU each have their own RAM. In a heterogeneous scenario, where game objects are maintained by both processors, a synchronized copy of data would have to be maintained in both processor’s memory. This isn’t impossible, but would be very difficult, because you would always have to work with the GPU to CPU bandwidth bottleneck in mind.

See MJP’s post: GPU Memory Pools in D3D12

1 Like

It would indeed really be nice to have more “batch operations” API for doing this kind of stuff, in this case something like :
Transform.SetPositions(Span<Transform> transforms, Span<Vector3> positions)

There is a huge overhead in doing a large amount of native calls, and such batch operations also usually open optimization opportunities on the native side. I wouldn’t be surprised if the majority of your loop time is directly caused by the overhead. There has been moves toward adding such APIs to adress this specific issue, unfortunately this was quite limited in scope, see https://discussions.unity.com/t/900985

GameObjects and Transforms aren’t strictly required, you should look into CommandBuffer and the DrawMesh* methods.