Unity DOTS vs Compute Shader

I was curious which was faster between DOTS (jobs + Burst + Collections) or Compute Shaders for doing large amounts of calculations. So I put together this speed test. Basically I just create a huge array and tell an IJobFor and a Compute Shader to fill it up with the result of a difficult math function. I schedule each one and wait for it to complete in the background by checking each frame from a Coroutine if it has completed. Then I log the time. My own results show that the DOTS job usually wins. Transfering the array back from the GPU takes a while. Turning off Safety Checks in the Burst menu sometimes makes the jobs run faster and sometimes not.
I have a intel I7-8700 6 core CPU and an RTX 2070 GPU.
My conclusion is that DOTS is a bit faster. Which you should use depends on which one is easier for you to use and whether you need the data on the GPU for rendering and donā€™t need it transferred back to CPU land.
Iā€™m curious what other peopleā€™s results are and whether you think my test is even valid. Iā€™d love to hear other peopleā€™s thoughts/knowledge on this topic.

Here are the scripts:

using System.Collections;
using UnityEngine;
using Unity.Jobs;
using Unity.Collections;
using Unity.Burst;
using Unity.Mathematics;


public class Dispatcher : MonoBehaviour
{
    [SerializeField] int arrayLength = 4_000_000;
    [SerializeField] int jobBatchSize = 32;
    [SerializeField] ComputeShader shader = null;
    [SerializeField] bool fetchWholeArrayFromShader = false;
    int randomIndex;

    public void RunTest()
    {
        Debug.Log("--------");
        randomIndex = UnityEngine.Random.Range(0, arrayLength);
        Debug.Log($"index {randomIndex}");
        RunDummyJobToForceEarlyCompilation();
        StartCoroutine(StartJob());
    }

    void RunDummyJobToForceEarlyCompilation()
    {
        float startT = Time.realtimeSinceStartup;
        var dummyArray = new NativeArray<float>(1, Allocator.TempJob);
        var dummyJob = new Job() { results = dummyArray };
        var dummyHandle = dummyJob.Schedule(1, new JobHandle());
        dummyHandle.Complete();
        Log(startT, "dummyJob", "");
        dummyArray.Dispose();
    }

    IEnumerator StartJob()
    {
        float startTime = Time.realtimeSinceStartup;
        var results = new NativeArray<float>(arrayLength, Allocator.TempJob);
        var job = new Job() { results = results };
        var handle = job.ScheduleParallel(arrayLength, jobBatchSize, new JobHandle());
        while (!handle.IsCompleted)
        {
            yield return null;
        }
        handle.Complete();
        float sampleValue = results[randomIndex];
        results.Dispose();
        Log(startTime, "Job", sampleValue.ToString());

        StartCoroutine(StartComputeShader());
    }

    IEnumerator StartComputeShader()
    {
        float startTime = Time.realtimeSinceStartup;
        int kernel = shader.FindKernel("CSMain");
        ComputeBuffer buffer = new ComputeBuffer(arrayLength, sizeof(float));
        shader.SetBuffer(kernel, "results", buffer);
        uint x, y, z;
        shader.GetKernelThreadGroupSizes(kernel, out x, out y, out z);
        int groupSize = (int)(x * y * z);
        shader.Dispatch(kernel, arrayLength / groupSize, 1, 1);
        var request = UnityEngine.Rendering.AsyncGPUReadback.Request(buffer);
        while (!request.done)
        {
            yield return null;
        }
        float sampleValue;
        if (fetchWholeArrayFromShader)
        {
            float[] results = new float[arrayLength];
            buffer.GetData(results);
            sampleValue = results[randomIndex];
        }
        else
        {
            float[] results = new float[1];
            buffer.GetData(results, 0, randomIndex, 1);
            sampleValue = results[0];
        }
        buffer.Release();
        Log(startTime, "Compute Shader", sampleValue.ToString());
    }

    void Log(float startTime, string workName, string sampleValue)
    {
        Debug.Log($"{(Time.realtimeSinceStartup - startTime) * 1000} ms {workName}, sample value {sampleValue}");
    }
}

[BurstCompile(CompileSynchronously = true)]
struct Job : IJobFor
{
    public NativeArray<float> results;

    public void Execute(int index)
    {
        results[index] = math.sin(math.sin(3));
    }
}

#if UNITY_EDITOR
[UnityEditor.CustomEditor(typeof(Dispatcher))]
public class Dispatcher_Editor : UnityEditor.Editor
{
    public override void OnInspectorGUI()
    {
        if (GUILayout.Button("Run Test"))
        {
            (target as Dispatcher).RunTest();
        }
        base.OnInspectorGUI();
    }
}
#endif
#pragma kernel CSMain

RWStructuredBuffer<float> results;

[numthreads(1024, 1, 1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
    results[id.x] = sin(sin(3));
}

I think your test is wrong sin(sin(3)) is constant
You need at least input array with floats that will be passed to sin. So it will be results[index] = sin(sin(input[index]))

also you are doing very little work so most of the time is spent outside the actual calculations since reading and writing data to gpu is the expensive part

My view on the subject is following.
GPU vs CPU.
Using DOTS you can easily extend functionality. But the main question is, do I want use GPU for calculations, where CPU sits idle, while GPU do havy renderings. Or I use CPU to ofload GPU, so I can have more fancy effects on GPU.
Many games under utilise CPU. So why not take an advantage of it, with DOTS?

13 Likes

I think it is going to vary a lot depending on what your task is, and what other tasks you are doing. And also hardware, of course.

I did a boids comparison last year. The code did not use space partitioning but was otherwise optimized. I was able to get 7000 boids at 60fps with just burst jobs and Graphics.DrawMeshInstanced - no ECS. That was my best version. The compute shader approach (with asynchronous retrieval of data and Graphics.DrawMeshInstanced) got me about 4500 boids, and the ECS version (with hybrid renderer) got me about 3500 boids at 60 fps. ECS went up to 5000 and 6000 if I updated a half or third of the boids each frame, in a way that fit the asynchronous compute shaderā€™s overall rate a little closer.
Burst jobs alone are very, very good. Everyone should be using them all the time. Compute shaders will likely be better when you can send the data to the GPU and keep it there. Reading from the GPU is the slowest part. I think ECS is going to scale up well as you have more and more tasks in your game / program. I donā€™t think there is going to be one right answer. You will have to test for your specific situation.

1 Like

FYI, dyson sphere program use GPU for logic extensively, Iā€™ve read the solar sail logic is implemented in GPU
Maybe even the conveyor logic are in the GPU as well, when you change conveyor, you can see all other conveyor stutter a bit, I am guessing itā€™s updating the compute buffer
Not to mention all the animation are gpu vertex animation
This is one game that take performance to the extreme

https://www.zhihu.com/question/442555442

I suggest you check out Unity DOTS samples and boids. There is school fish using boids. I can render 50k fish using this approach, on my 5 year old rig in editor. Now probably sample would run even better.

https://www.youtube.com/watch?v=Iv_ZktC865A

1 Like

Thats nothing.

https://www.youtube.com/watch?v=mNZq0RhM-98

You can download the code to try yourself as well. I can get 400,000 at 30fps on my computer, and probably more if optimised other stuff.

3 Likes

Thanks everyone for your replies! @JesOb and @HellGate94 you guys are probably right. Perhaps my test would show different results if I follow your advice. But I think for now I have enough info to go off of. What I was trying to figure out was if one or the other wins in a landslide. Before I ran this test I was expecting the GPU to massively outperform the CPU because the GPU has way more cores. But now I sort of view DOTS and compute shaders on roughly the same level. Now I agree with @Antypodish that it really depends on your use case. I was about ready to totally rewrite my project to use compute shaders instead of DOTS and maybe I would have done that if this test showed compute shaders to be far far superior, but now I will just stick with DOTS since the performance is in the same ballpark.

Yeah Iā€™ve seen the Unity boids sample. I think the space partitioning gives a big boost to performance, though its hard to speculate exactly since Unityā€™s boids are very different from mine. They also use a few separate schools, which is an easy way to double the overall number, but that to me is isnā€™t the same. Still excellent performance with a single Unity school due to the space partitioning. I never got around to implementing the space partitioning because my goal at the time was really just to compare different approaches (compute shader, monobehavior + burst jobs, and ECS), and by the time I had done that, I was pretty satisfied and ready to move on. I might pick it back up one day and try to push it a little farther, but for now it served is purpose of comparison.

Hi @joshrs926 , good on you to test your assumptions before blindly rewriting your algorithms! I would add that itā€™s equally important to measure what youā€™re looking to optimize to make sure thatā€™s where you need to direct your optimization efforts.

Regarding your benchmark however, itā€™s a bit flawed unfortunately:

  • Itā€™s measuring a roundtrip from a coroutine, creating a buffer, (asynchronously) executing a command on the GPU, asynchronously downloading the results from the GPU, and finally waiting for the next coroutine execution to stop the stopwatch. Therefore there are many more orders of magnitude of ā€œoverheadā€ stuff happening here on top of the GPU execution.
  • sin(sin(3)); is a constant operation (the compiler computes it). Doing that 1024 times is probably in total less than 10 cycles per ā€œthreadā€. This is a trivial amount of work and just setting up the shader execution takes longer than the actual shader.
  • In the context of Unity, compute shaders are more useful for stuff that doesnā€™t need to be downloaded back to the CPU (e.g. animating particles, processing a texture).
  • GPU execution time should be measured with some kind of GPU profiler. Since youā€™re timing coroutines, both tests probably give around the same results simply because the coroutines are resumed once per frame so at ~16ms intervals for 60fps.
    • some other more minor things :wink:

Donā€™t feel bad though because benchmarks are hard to get right. Even when they are technically correct, itā€™s really easy to unknowingly measure a use-case that differs from what we actually wanted to measure.

As fast as DOTS can be, compute shaders can absolutely blow it out of the water for certain classes of workloads.

Despite all this, I agree with your conclusion based on the test you did. You donā€™t seem to have enough of a workload to warrant using compute shaders and the added complexity will just slow you down at this point.

19 Likes

I would really love to see results of tests : amd(32/64 cores - providing os/unity can utilize all cores)-DOTS vs
rtx 3090-CS - Iā€™m sure CS will win(i guess in most tests but ofc not all) but not sure it will own DOTS

1 Like

There are a lot of people doing cool stuff with the gpu. But I think most are rather clueless about where to start. I was until I did more of a deep dive into this area.

Focus on rendering first. Engines especially Unity due to some limited apiā€™s, barely scratch the surface in this area. This is highly likely where your biggest bang for buck items are. Requires more of a deeper knowledge of rendering then compute per say to leverage well.

Gpu concurrency models often require some fairly complex approaches. A naĆÆve implementation can hurt more then help, say by negatively impacting rendering.

That said imperfect can be ok. The gains can be good enough for a naive implementation to still work. Nvidia for example has a whole suite of software that often performs orders of magnitude better then what engines do. In comparison the engine version is naĆÆve. But of course Nvidia developers are uniquely qualified.

Problems that fit well are generally well known. You donā€™t need to go looking for where to use the gpu. If you have some specific problem you are solving and compute is a good fit, then just basic google research is going to tell you that.

Thanks for the reply! Itā€™s good to hear that compute shaders can blow DOTS out of the water in certain cases as thatā€™s what I would expect since thereā€™s generally way more cores than a cpu. Im getting the impression that Compute shaders are generally good when the results are needed on the GPU each frame or if the job is performed once in a while and itā€™s ok for the CPU to wait a bit for results, and when the computations can be split into roughly similar sized, simple chunks. My current project requires the job to be performed just once in a great while and itā€™s ok for the CPU to wait a while for the results, but it would be really hard to split up the job into simple equal chunks. Each chunk can vary in complexity/time to complete. So it seems best for now to do this on the CPU.

Iā€™d say avoid using compute shaders unless you got no other choice. Iā€™ve spend a great deal of time messing with it for my grass, and converting it all to Burst + Jobs was the best thing Iā€™ve done. Sure, there was no Burst + Jobs when I stared developing the grass, so it was literally the only way to get all that grass generated in a playerā€™s attention span, but Itā€™s much easier to interact with it when I donā€™t need to ask the GPU to do thing and wait for the answer multiple frames in the future.
There is still one compute shader I use that serializes a two dimensional array into a single dimensional array, so I only need to send which ā€˜groups of up to 1024ā€™ blades of grass I want rendered, and the compute shader serializes the blocks into a single array for instanced rendering. That way I both leave some heavy work for the GPU, but I also drastically reduce the amount of data I send to it every frame. There is a very specific time and place to use Compute shaders.

I am really hyped for the day we can send burstable jobs to the GPU without having to write a ton of code, since most of the burst restrictions are already basically the same restrictions you have on the GPU shaders anyways.

I find this a fascinating thread and would like to see more usecases and/or examples that defines a GPU vs CPU-approach.

Some examples of compute shader usage in Unity are:

Manipulating data already on the GPU (textures, meshes, etc.) in some way and using the result on the GPU is usually a clear win for compute shaders. This avoids costly CPU-GPU copies.

When the result of an algorithm needs to be accessible on the CPU in some way, there is a good chance that burst compiled jobs will win over compute shaders. It all depends :slight_smile:

Compute shaders are harder to work with though since you canā€™t inspect what they do as easily.

7 Likes

Thank you for this breakdown. Super informative to read about how you guys use/think of Compute Shaders.

Iā€™ve added something like ā€œWhen to use compute shaders vs burst vs jobsā€ to my list of subjects Iā€™d like to write a blog post about. I can see how it would be interesting for Unity devs. I have no idea when Iā€™ll have time to work on that, but itā€™s noted :slight_smile:

22 Likes

Curious if this ever came to fruition.