Processing many billboards (or processing many anything!)

Hi there! I know there are several tools made available by unity such as Compute, C# jobs, DOTs, etc, but my knowledge is a bit limited here and was hoping someone might be able to give me a leg up.

I’ve got a fairly standard unity project with the built-in render pipeline. Sometimes I come across a problem where I’ve got a lot of simple things I’d like to process and I’d like to… I dunno, parallelizify them or whatever you real programmers do.

For example I’ve got a tree here with a lot of billboard cards, each slightly different because they’ve got custom normals applied. I’d like to have all of them billboard, but I don’t want to add 30 billboard monobehaviours per tree. that seems insanely stupid. Even having a master script to process them all seems stupid. Sending them to the GPU or something like that seems logical.

What’s the best way to proceed in times like this? I’m going to need to learn something new and was hoping someone might give me a recommendation for a first step. Compute, perhaps? is the job system better for some reason? can DOTS be half-implemented into a project or is it all-or-nothing? what kinds of options exists for parallelizifimogrification? any strong opinions about where one should start treading, or anything to avoid? anything that works particularly well with Unity’s build-in systems?

Thanks!

For your particular scenario you want something like gpu instancing.

However, that’s not really processing, more like rendering. Obviously it’s a wide topic, from multi-threading, to compute shaders, and all kinds of acronyms such as SIMD, ECS, DOTS. Each technology has its particular domain, associated hardware, as well as developmental or running costs.

2 Likes

Hey thanks so much for taking the time to comment. sounds like your answer is “yeah, there’s a lot of solutions, but no simple rubric for evaluating which one is appropriate for a different task. You pretty much need to understand them all, good luck bruh.”

and fair enough. I do feel that a “which parallel solution is right for me” flow chart would be a good resource. For example I’ve heard that getting data back from the GPU is expensive. If that’s not needed it might change what solution is used (like in the case of billboards). Maybe I should take some time to research the different options (cuda, C# jobs, etc) and try to generate a simple rubric for selecting them, based on the overhead cost, given the task.

OH PS I literally just learned that you can do billboarding a shader so… derp. Pretty sure that’s a better idea than the monobehaviour shit I was using.

Hey this is useful:

What I should do is just amass a list of various techs and do some searches for X vs Y.

It does seem like you are getting into the deeper topics of software development, however are you forced to?
Do not fall into the trap of “premature optimization”. Does your game contain many trees? Since a couple hundred monobehaviors are still fine and you can rescue some performance via static batching. Unity has a topic on that:

You could read the parent topic of that as well.

First of all it seems you have one misconception:
Unfortunately when it comes to rendering, you cannot have a “DoRenderALeaf(leaf_info)” method and then try to call this somehow in parallel for all leaves and expect a performance bonus vs separate monobehaviors.

The reason is, not only taking data back from GPU is slow, sending to GPU is as well. Every “DoRender” method would be a so called “draw call” and you should not have more draw calls than maybe some thousands on PC.
Instead you want to send all information on what to render, to the GPU at once. Usually that is not that easy. Unity’s call batching does help. The link above will tell you more.
Alternatively there is also this method but you’ll quickly see that requires more from the developer: Unity - Scripting API: Graphics.DrawMeshInstanced

Another way which I personally use to populate an underwater world with corals, is to use the particle system and have all leaves be particles (which you set by code) with infinite lifetime. The particle system also batches them into a singular draw call (and it happens with a C++ implementation, thus really fast). Of course this is only really meaningful if your leaves are already created via an algorithm and you don’t need to place them by hand in the editor.

Here a quick overview on the parallelism techniques as far as I know them:

Compute shaders:
A way to do massive parallel computations with the GPU. Working with this is tricky because there’s usually no debugger or print method available on the code that’s executed on the GPU. When you harness that power you can do amazing things like these:

Every frame, a formula is being executed for every singular pixel of the fluid texture!
This is only really useful for extremely specific usecases in game development though and as you already know, retrieving the data from the GPU is an additional hurdle, thus it’s most commonly used for visual effects where all information remains on the GPU. In the case of the fluid, the new computed fluid particle positions are directly used in the next frame so no data needs to be sent to the CPU.

Vertex manipulation in shaders:
The enemy of batching drawcalls is changing data. E.g. if you want to have the leaves move in the wind, all batching is thrown out if you need to change the coordinates, angles etc. every frame. Normal, regular shaders come to the rescue. You can access some time value in a shader and manipulate the vertices as well as colors etc. to achieve the desired effect highly efficient. Randomly variating the orientation of the leaves can also be an effect for that.
This is less involved than compute shaders and you’ll find many tutorials.

The topics above are interesting for rendering-related things but there can be other times you have high performance requirements. Then your options are:

Threads:
Those are almost a honorable mention because not very common in Unity.
In principle Unity is just a C# application however. So you can spawn threads and even processes. The infamous “race conditions” are the typical result though and Unity has put a hard stop to those: You are not allowed to access pretty much any Unity API from another thread but the “main thread” (which is the one calling FixedUpdate, Update etc.). Not even set the transform.position of a game object.
Furthermore spawning threads is slow. Creating and closing a thread every frame will most likely diminish any gains from parallelization.
Thread pools are a thing against that but they get tricky.
Personally I only use threads to monitor a custom written DLL.

Job System:
Instead of threads that are managed by the OS, Unity has their own form of micro threads, the “jobs”. Using them is somewhat involved, but it can be learned and the benefits are awesome. You need a more data based way of thinking (as opposed to object oriented) but then Unity can guarantee you that there are no race conditions without you having to deal with “semaphores”, “mutex” and similar constructs. Despite working on CPU, you can easily start hundreds of jobs and they will be executed with incredibly little overhead. Unity uses them internally for several systems as well.
API wise jobs work a tiny bit like compute shaders: You provide the data, launch the job (via schedule()) and later (ideally after 1-4 frames) when you need the result, you enforce the completion with “Complete()” (for the case it hasn’t completed yet). Since it’s CPU, there’s no extra time needed for data retrieval.

Finally jobs work in tandem with the Burst compiler. As a programmer you do not need to do much besides adding it to the project and marking your job structs accordingly. It results in a massive speedboost compared to C#. I’ve witnessed 40x.

ECS/DOTS:
The Entity Component System which is part of Unity’s Data-Oriented Technology Stack goes another step further.
Parallelization of code-execution is sometimes not enough because even if you tell your CPU to compute X things in parallel and it has the cores to do so, the memory is often the final bottleneck.
That’s why to achieve the absolute maximum you need to think of how your data is stored in memory so it is available as quickly as possible. That usually works by having the data that’s needed simultaneously be available “nearby” so there is less arbitrary random memory access.
ECS is a framework for exactly that.
It builds upon Jobs and Burst but tries to have little ties to Monobehavior instances because those are stored in those undesirable arbitrary memory locations. Instead you have ordered “Entities” with their data stored in lists.
It is a different way of programming. However you do not need to dig into it rightaway unless you intend to develop a RTS game :slight_smile:
It’s also still in development.

Coroutines:
Another honorable mention because despite their name which may confuse you if you come from other programming languages, they are not “real” parallelism. They are only a way to tell Unity in a concise way to execute something during the next X number of frames (or until a condition is met etc.). They are being executed on the main thread, thus no direct performance gain. They can be a small performance gain if the alternative is to have many “if” in an Update() method instead which would mean it’s being checked every frame the whole game while the Coroutine might only have been running occasionally once per minute.
They are easy to learn and lead to more readable code, so worth learning early.

Hope this gives you a headstart :slight_smile:

P.s. Before you employ any of the parallelism solutions: Use the Profiler and FrameDebugger to verify that you actually have a performance problem in the code you try to rework.

2 Likes

DAMN! That’s an epic response! Thank you so much for taking the time to write all of this! It’s very helpful. I love that you listed them all and offered some pros and cons that you saw. Really quite lovely.

In response to your first question - yeah, don’t worry, I’m in no hurry to apply undue optimization. This is a general question that has come up over time in development when I’ve discovered I’ve had to optimize some poorly made system written by some idiot (me) and I’ve gotten intimations like “I’m pretty sure that X, Y, or Z system would help me here if I knew them.” Here’s an example of something I built. Every frame it needs to process billboarding and setting the animation frame, and all that happens on the main thread. In a large scene that starts to add up.