Lately I’m experimenting with image processing to check what I could juice out of DOTS in the Unity 2019.2. Inspired by Peter77 performance tests I’ve decided to do some myself and share the results with you.
We’ll see how quickly we can do 100 simple color fills of the 1024x1024 RGBA32 image. We’ll use old methods of working with texture pixels, the new way using native arrays, jobs with burst compilation and we’ll have a CPU DOTS vs GPU compute shader boss fight.
Currently, as far as I know, there are seven methods of modifying a texture. Those would be:
-
Using GetPixel() to get an array of colors, modify them in a loop, SetPixels() and Apply() the texture.
-
Using GetRawTextureData() to get an array of bytes, modify them in a loop, LoadRawTextureData() and Apply() the texture.
-
Using GetRawTextureData<>() template to get a native array of colors, queue parallel, burst compilable jobs and Apply() the texture after finishing.
-
Run compute shader in a loop.*
-
Same as 3rd method, but improved by using optimal solutions to get the best assemblies from the burst compiler.
-
Using RenderTexture render any camera to it.*
-
Graphics.Blit()*
-
- for those methods we need to use a RenderTexture.
I wanted to avoid using ReadPixels to have access to the data on the CPU so I’ve only tested methods 1-5.
All tests were done in a non-development release windows x64 build. Below are the results (less is better):
Test results chart
Results without slowpokes
Jobs profiler
There’s no doubt that DOTS is the new killer way of working with images on the CPUs, especially on those having high amount of physical cores (I’m looking at you Ryzen Threadripper). There’s a huge overhead on compute shaders CPU<->GPU communication, and by writing the code with assembly optimization in mind, depending on the algorithgm, you can easily get better performance on the CPU, as long as it isn’t busy with other tasks.
Hardware used in the test:
CPU: i7-8750H
GPU: 1060 GTX 6GB
RAM: 16GB 2400MHz DDR4 (1 x 16GB)
** @ **
I’ve added a new test case where the struct is optimized to yield a better assembly, after learning more about burst generated assemblies from here. Thanks 5argon!
And holy smokes it is fast. All it took was using uint4 from Unity.Mathematics package, instead of four float variables.