2019.2 DOTS image processing performance

Lately I’m experimenting with image processing to check what I could juice out of DOTS in the Unity 2019.2. Inspired by Peter77 performance tests I’ve decided to do some myself and share the results with you.

We’ll see how quickly we can do 100 simple color fills of the 1024x1024 RGBA32 image. We’ll use old methods of working with texture pixels, the new way using native arrays, jobs with burst compilation and we’ll have a CPU DOTS vs GPU compute shader boss fight.

Currently, as far as I know, there are seven methods of modifying a texture. Those would be:

  • Using GetPixel() to get an array of colors, modify them in a loop, SetPixels() and Apply() the texture.

  • Using GetRawTextureData() to get an array of bytes, modify them in a loop, LoadRawTextureData() and Apply() the texture.

  • Using GetRawTextureData<>() template to get a native array of colors, queue parallel, burst compilable jobs and Apply() the texture after finishing.

  • Run compute shader in a loop.*

  • Same as 3rd method, but improved by using optimal solutions to get the best assemblies from the burst compiler.

  • Using RenderTexture render any camera to it.*

  • Graphics.Blit()*

    • for those methods we need to use a RenderTexture.

I wanted to avoid using ReadPixels to have access to the data on the CPU so I’ve only tested methods 1-5.
All tests were done in a non-development release windows x64 build. Below are the results (less is better):

Test results chart


Results without slowpokes
Jobs profiler

There’s no doubt that DOTS is the new killer way of working with images on the CPUs, especially on those having high amount of physical cores (I’m looking at you Ryzen Threadripper). There’s a huge overhead on compute shaders CPU<->GPU communication, and by writing the code with assembly optimization in mind, depending on the algorithgm, you can easily get better performance on the CPU, as long as it isn’t busy with other tasks.

Hardware used in the test:
CPU: i7-8750H
GPU: 1060 GTX 6GB
RAM: 16GB 2400MHz DDR4 (1 x 16GB)

** @ **
I’ve added a new test case where the struct is optimized to yield a better assembly, after learning more about burst generated assemblies from here. Thanks 5argon!
And holy smokes it is fast. All it took was using uint4 from Unity.Mathematics package, instead of four float variables.

6 Likes

Hey, I am also writing a shaderless hue shifter (+ other methods from common image processor) using GetRawTextureData<>(). Did you use IJobParallelFor to segment the bytes to be worked on? (If the algorithm do not require neighbouring pixels) If so, how many inner loop batch count you used for 1024x1024?

GetRawTextureData<>() is a bit of hassle to check for texture format, since if it has no alpha then it’s 3 bytes per pixel instead of 4 but I will see how it goes.

Yes. I’ve used 1024 batches.
Taking in consideration the documentation for IJobParallelFor:

Using less than 512 batches was just slower in average. Then raising the batch amount to square root of image’s width and height, which in this case is 1024, made the time more stable, at a visible small raise of time spent on overhead, what was still less overall than 512 batch average time.

I took a look in the profiler and it might be caused by the better atomization of the jobs caused by the high batch count. Every job execution has less chance to block the next one when they only have to do a limited amount of steps. This also caused less scatter in performance.

It certainly needs some extra tests. What would be the result if other systems of a game already eat lots of performance? What if our game simulates a lot on CPU, but is not GPU intensive? Would it make compute shaders more attractive to use in a place of a bursted job? What is the result on older/newer CPUs? What about a more expensive algorithm than just a color fill?

Might look into testing those in the nearest future.

1 Like

Cool, in fact I am using the same batch size.

On a slightly related note I have debugged and found that for NativeArray vectorize in a set of 128 loops, so any for loop multiple of 128 should be advantageous because leftover loops will be iterated linearly without using big registers. (Analyzing Burst generated assemblies)

3 Likes

Thank you, you’ve just blew my mind. I’ll have to update the test now due to what I’ve just learned from you. looks like I haven’t been getting optimised assemblies after all, and that using four separate floats vs one float4 in a struct makes a huge difference.

Okay, we have a winner. I have added a new test case which tries to get the best assemblies from the burst compilation:

  • NativeArray + Jobs + Burst gives us an average of 32.8 ms
  • New NativeArray + Jobs + Burst + Optimizations is 9.43 ms

One image is 1024 x 1024, 32 bits per pixel. That’s 33554432 bits, what is 4.19 MB Times 100 we get 419 MB of data processed in 9.43 ms, and I don’t even have dual channel ram due to my laptop having only one ram stick.

Interesting fact: Using the best method to do the same color fill operation 10000 times on the same image, it takes on average 392.29 ms, while doing the operation only once takes on average 3.78 ms.

2 Likes

You guys rock! I want to learn more about all of that!
Currently I need to stitch several tiles to create an image on a mobile phone. I am struggling with performance. Thought about writing a shader, but maybe burst compiling could be better.
If you guys have any direction to point me towards it would make my day. Thank you guys.

Very nice results indeed! Any possibility to share this test code somewhere?

You can check how to create and run jobs in this tutorial:

https://www.youtube.com/watch?v=C56bbgtPr_w

Burst compiling just requires you to use [BurstCompile] attribute in your job struct, which causes all the magic to happen.

As for the texture data, you need to get it in a correct format to pass it into a job. You can get it by using the texture.GetRawTextureData() method, where T is the struct that you want the data to be interpreted as.

If it is a RGBA32 texture, you might want to use a struct like this:

    [StructLayout(LayoutKind.Explicit)]
    public struct Pixel{

        #region Data
        [FieldOffset(0)]
        public int rgba;

        [FieldOffset(0)]
        public byte r;

        [FieldOffset(1)]
        public byte g;

        [FieldOffset(2)]
        public byte b;

        [FieldOffset(3)]
        public byte a;
    }

or like this, for better burst optimization:

    public struct Pixel4{
        public uint4 xyzw;
 }

Have in mind that in the second example you are editing 4 pixels at once. This format is great if you just need to copy the data between textures.

Then when you use texture.GetRawTextureData() you’ll get a neat NativeArray object, which you can pass into your jobs.

A very basic copy job would require just two fields, one for origin NativeArray, second one for the target.
In the Execute method you just assign all pixels of one NativeArray to another in a loop and voila, you’ve copied the texture in the speed of light! For it to suit your needs the job struct will also need to know the width/height of your textures, as well as x/y positions where you want the copy to occur. Good luck!

It’s currently a part of a bigger system, but I’ll look into moving the test to a small project just for sharing :).

4 Likes

Just wondering if there are quicker ways to fill a texture e.g. dedicated SIMD/AVX instructions. And if the texture is going to be used on the GPU does it need to move from CPU to GPU for processing and then back?

Thank you so much for the directions Griz!

1 Like

To be clear here though, a uint4 is the size of four pixels - each uint contains 4 bytes, which is one pixel of the image. It might be weird to have to process an image with the color data packed into a uint

The struct should be called a Pixel4, at least

That’s correct, thank you for the heads up! I’ve edited the original post with this in mind.