No measured efficiency gain with ScheduleParallel & BurstCompile

Hey what’s up.

I am generating an ECS program to write some textures to a Cubemap asset. I’ve got it working correctly, except for one weird issue…

ScheduleParrallel and/or using BurstCompile are not completing the ECS code any faster. I have a manual timer in the function, and removing the above strategies from the code in favor of just running the job on the main thread, is delivering the same speed of results…

I am really not sure what I am missing. Here is the code I’ve got so far.

using Unity.Burst;
using Unity.Collections;
using Unity.Entities;
using Unity.Jobs;
using Unity.Mathematics;
using UnityEngine;

public class MAIN : MonoBehaviour
{
    public Cubemap galaxyBackground;
    public int backgroundResolution;
   
    private void Awake()
    {
        float startingTime = Time.realtimeSinceStartup;

        World world = World.DefaultGameObjectInjectionWorld;
        EntityManager entityManager = world.EntityManager;
        Color[][] colorsArrays = new Color[6][];
        int backgroundPixelCount = backgroundResolution * backgroundResolution;

        EntityArchetype pixelColorArchetype = entityManager.CreateArchetype(ComponentType.ReadWrite<APixelColor>());
        entityManager.CreateEntity(pixelColorArchetype, backgroundPixelCount * 6);
        EntityQuery query = entityManager.CreateEntityQuery(ComponentType.ReadWrite<APixelColor>());
       
        SystemHandle GalaxyBackgroundSystemHandle = world.CreateSystem(typeof(GalaxyBackgroundSystem));
       
        NativeArray<APixelColor> colorsNative = query.ToComponentDataArray<APixelColor>(Allocator.Temp);
       
        Debug.Log("ECS: " + (Time.realtimeSinceStartup - startingTime).ToString() + " ms");

        Color currentColor = new Color();       
        int currentIndex = 0;

        for (int i = 0; i < colorsArrays.Length; i++)
        {
            colorsArrays[i] = new Color[backgroundPixelCount];
            for (int j = 0; j < backgroundPixelCount; j++)
            {
                currentIndex = j + j * i;
                currentColor.r = colorsNative[currentIndex].color.x;
                currentColor.g = colorsNative[currentIndex].color.y;
                currentColor.b = colorsNative[currentIndex].color.z;
                currentColor.a = colorsNative[currentIndex].color.w;
                colorsArrays[i][j] = currentColor;
            }
        }
       
        galaxyBackground.SetPixels(colorsArrays[0], CubemapFace.PositiveY);
        galaxyBackground.SetPixels(colorsArrays[1], CubemapFace.PositiveZ);
        galaxyBackground.SetPixels(colorsArrays[2], CubemapFace.NegativeY);
        galaxyBackground.SetPixels(colorsArrays[3], CubemapFace.NegativeZ);
        galaxyBackground.SetPixels(colorsArrays[4], CubemapFace.PositiveX);
        galaxyBackground.SetPixels(colorsArrays[5], CubemapFace.NegativeX);
        galaxyBackground.Apply();
        colorsNative.Dispose();
        world.DestroySystem(GalaxyBackgroundSystemHandle);

        float endingTime = Time.realtimeSinceStartup;
        Debug.Log("Final: " + (endingTime - startingTime).ToString() + " ms");
    }
}

[BurstCompile]
[DisableAutoCreation]
public partial struct GalaxyBackgroundSystem : ISystem
{   
    [BurstCompile]
    public void OnCreate(ref SystemState state)
    {
        //Both of these approaches have the same performance, despite how many chunks
        //have been generated by the archetype        

        //JobHandle handle = new JobHandle();
        //handle = new AssignPixelColors().ScheduleParallel(handle);
        //handle.Complete();
        new AssignPixelColors().Run();
    }
}

[BurstCompile]
public partial struct AssignPixelColors : IJobEntity
{
    [BurstCompile]
    public void Execute(ref APixelColor data)
    {
        data.color = new float4(1f, 0f, 0f, 1f);
    }
}

The resolution for each texture is 2048 *2048. There should be plenty of chunks to leverage multiple cores upon? Instead, simply running this on the main thread without burst compile attributes yields the same performance… There will obviously be more work eventually in the job for random results, but what am I missing here in my understandings of ECS and parallel processing.

Show a snapshot of the profiler with the worker threads, first and easiest place to look when asking about performance. It would also be more meaningful to measure the execution time of the body of OnCreate in your system, everything else is irrelevant for what you’re supposed to be measuring. Given OnCreate needs to be Burst-compiled, that would be probably be most easily accomplished by only timing the call to CreateSystem.

Not sure if these are correct, but here are some screenshots of the Profiler. Thank you for taking a look at this. Let me know if I need to upload it differently.

Run on main thread:

Run with ScheduleParallel:

Just tested the non-ECS code for efficiency. Simply writing directly to the Color array, without any ECS, is actually faster.

I suppose this is a combination of allocating an array to simply assign to an array, as well as allocating an entity to represent each pixel… Upon using the non-ECS approach, the process now runs faster by whole multiples.

I did not see much difference when I added random numbers to the IJobEntity, so it may also be the case that most of the overhead was actually in allocating the entities, the NativeArray, and transferring from the NativeArray.

Here is the code that goes faster:

Color[][] colorsArrays = new Color[6][];
int backgroundPixelCount = backgroundResolution * backgroundResolution;

Color currentColor = new Color();       

for (int i = 0; i < colorsArrays.Length; i++)
{
    colorsArrays[i] = new Color[backgroundPixelCount];
    for (int j = 0; j < backgroundPixelCount; j++)
    {
        if(UnityEngine.Random.Range(0f, 1f) < backgroundStarSaturation)
        {
            currentColor.r = UnityEngine.Random.Range(0f, 1f);
            currentColor.g = UnityEngine.Random.Range(0f, 1f);
            currentColor.b = UnityEngine.Random.Range(0f, 1f);
            currentColor.a = 1f;
            colorsArrays[i][j] = currentColor;
        }
        else
        {
            currentColor.r = 0.01f;
            currentColor.g = 0.01f;
            currentColor.b = 0.01f;
            currentColor.a = 1f;
            colorsArrays[i][j] = currentColor;
        }
    }
}

galaxyBackground.SetPixels(colorsArrays[0], CubemapFace.PositiveY);
galaxyBackground.SetPixels(colorsArrays[1], CubemapFace.PositiveZ);
galaxyBackground.SetPixels(colorsArrays[2], CubemapFace.NegativeY);
galaxyBackground.SetPixels(colorsArrays[3], CubemapFace.NegativeZ);
galaxyBackground.SetPixels(colorsArrays[4], CubemapFace.PositiveX);
galaxyBackground.SetPixels(colorsArrays[5], CubemapFace.NegativeX);
galaxyBackground.Apply();

What we really need to look at is the execution of your script’s Awake, so zooming in on that would be helpful. The big thing is seeing how long the scheduled jobs take vs the time for the whole method. Profiler markers to see individual parts of the method would be very useful as well. Also, it might be good to measure the second or later execution (re-start Play Mode) to avoid JIT being involved. My expectation would be the actual operation of the job doesn’t take that long relative to everything else, since it’s a simple operation Burst would generate very efficient code for, so the difference in execution time would be negligible.

There is definitely overhead in doing things like scheduling, allocating entities, and filling an array from components, which wouldn’t be present when you just do things per-element from the start. I’d suggest trying a simple IJobParallelFor if you haven’t done so before / if it’s applicable here, and see if that further improves things. For Burst, using Unity.Mathematics.Random and CreateFromIndex makes sense. Entities seemed overkill in this example, but I presume it’s a contrived example or starting point for something more complex. The best way seems like scheduling 6 jobs and completing them together with JobHandle.CompleteAll, 1 job for each array, where you use the C# fixed statement (for safety, enclose the execution scope of the job for each array, so have the fixed statement get 6 pointers together) to get a pointer to the managed array (using NativeDisableUnsafePtrRestrictionAttribute to make it a job field, along with length), something like this:

public struct ParallelJob1 : IJobParallelFor
{
  [NativeDisableUnsafePtrRestriction] public unsafe Color* BPtr;
  public int BCount;
  public uint InitRandom;
  public void Execute(int index)
  {
    Unity.Mathematics.Random subState = Unity.Mathematics.Random.CreateFromIndex((uint)index ^ InitRandom);
    // …
  }
}

Hello, what kind of improvement are you expecting, please? And also is this only run in Awake once or do you plan it to be in some update?

You biggest bottleneck is on the main thread which is galaxyBackground.Apply → this stalls the whole frame because the textures need to be uploaded to GPU from CPU if you are on a platform that support compute sharers use those.

If you really need to do this on CPU please zoom in on the part of the Player-loop which now takes 28.5ms there you will most likely see a very tiny bar of your actual code taking some microseconds maybe few milliseconds from all that as we can see there are few jobs that ran on your worker threads. Its then debatable if its actually worth it scheduling jobs or if the overhead is too big and its actually faster to run on main thread or just running one job and completing it for example in Start since you schedule in Awake

1 Like

I have recreated your project and for me on I9-10850K (10cores/20threads) CPU the ECS part runs for 1second on mainthread and 0.95 seconds on with worker thread so there is difference, most of the time is spent on creating all the 25milion entities which is done on mainthread and then you create the system which runs “fast” compared to the rest only 32ms per core for me.

The code is actually so slow that the profiler cant even show it and it shows inside editor loop. and in build its just not there. The profiler pictures at the top probably don’t even show the correct frame.

At this point mandatory question is why are you creating entity for every-pixel, is this some kind of sky simulation and you just need to get the time down**?** If yes then this Awake will be slow only in beginning and then the update of the texture is fast enough for 60 fps if we only calculate with the 32ms that it took for me. You then have to deal with the slow texture upload every time you update it.

I was aiming for generating the texture at start-up, in order to decrease download times. This was also for experientially applying ECS to something real… This is part of why. I need to get better with applying ECS in a project.

Part of the issue here is that I am trying to plug ECS into existing engine architecture (MonoBehavior). If this was simply one or the other, the friction wouldn’t exist. This is good experience for figuring out some realities of ECS.

I haven’t tested this and I got CoPilot to refine it.

Main points here, working in chunks of 4 (float4) to use SIMD. Using LoadRawTextureData to pass this long array of floats directly into the texture. May have to refine it regarding the different CubeMaps, but I’m not sure how the Texture data is stored, but I’m hoping loading raw will work.

Optimising code, is doing more, in less code/instructions. Anyway, be interested to know the results!

public class GalaxyBackgroundGenerator : MonoBehaviour
{
    public Texture2D galaxyBackground;
    public int backgroundResolution;
    public float backgroundStarSaturation;

    void Start()
    {
        int backgroundPixelCount = backgroundResolution * backgroundResolution;
        NativeArray<float4> pixelData = new NativeArray<float4>(backgroundPixelCount * 6, Allocator.TempJob);

        GenerateStarsJob jobData = new GenerateStarsJob
        {
            pixelData = pixelData.Reinterpret<float4>(sizeof(float) * 4),
            starSaturation = backgroundStarSaturation
        };

        JobHandle handle = jobData.Schedule(pixelData.Length, 64);
        handle.Complete();

        galaxyBackground.LoadRawTextureData(colors);
        galaxyBackground.Apply();

        pixelData.Dispose();
    }
}

using Unity.Collections;
using Unity.Jobs;
using Unity.Burst;
using Unity.Mathematics;
using UnityEngine;

[BurstCompile]
public struct GenerateStarsJob : IJobParallelFor
{
    [NativeDisableParallelForRestriction]
    public NativeArray<float4> pixelData;
    public float starSaturation;
    private static readonly float4 defaultColor = new float4(0.01f, 0.01f, 0.01f, 1f); // Default color for non-stars

    public void Execute(int index)
    {
        Random rnd = new Random((uint)(index * 0x8C3F3B89)); // Seed based on index
        float4 myColor = rnd.NextFloat4();

        if (myColor.z < starSaturation)
        {
            myColor.z = 1f; // Set alpha to 1 for visible stars
        }
        else
        {
            myColor = defaultColor; // Use default color for non-stars
        }

        pixelData[index] = myColor;
    }
}

I realised now you are using CubeMap, so you would use SetPixelData which loads raw NativeArray data, similar to your SetPixels, with different offsets for your large flat array (textureSize * myIntCubemapOffset)