Instantiate is slow - What are solutions for optimisation

Where I’m now there’s a weird happening, I’ve tried isolated instantiation tests. Instantiating my archetype 250k times measures in around 4ms. So the test case isn’t that bad. In practice the instantiation measures around 8-10ms. I think the reason is because the CPU is under load from before but it’s really weird that the timings are so different for doing essentially the same thing.

Still, I feel like entity instantiation should be faster. Copying overall 45MB of the same data, 250k times shouldn’t even measure over 1ms, yet it does and it makes me stuck in how I should tackle this problem.
Any ideas are welcome.

Here’s some code for my current test system:

using Unity.Collections;
using Unity.Entities;
using UnityEngine.Profiling;

public struct Comp1 : IComponentData
{
    public int val1;
    public Entity reference1;
}

public struct Comp2 : IComponentData{ } // just a tag

public struct Comp3 : IComponentData
{
    public Entity reference1;
    public Entity reference2;
}

public struct Comp4 : IComponentData
{
    public float val1;
    public float val2;
    public float val3;

    public int val4;
    public int val5;
    public int val6;

    public float val7;
    public float val8;

    public float val9;
    public float val10;
    public float val11;

    public int val12;

    public int val13;

    public int val14;
}

[AlwaysUpdateSystem]
public class InstantiateTestSystem : SystemBase
{
    EntityArchetype CalculateArchetype;
    EntityArchetype CalculateArchetypeSmall;


    EntityQuery querySmall;
    EntityQuery queryBig;

    protected override void OnCreate()
    {
        CalculateArchetype = EntityManager.CreateArchetype(typeof(Comp1), typeof(Comp2), typeof(Comp3), typeof(Comp4));
        CalculateArchetypeSmall = EntityManager.CreateArchetype(typeof(Comp1), typeof(Comp2), typeof(Comp3));

        queryBig = GetEntityQuery(typeof(Comp1), typeof(Comp2), typeof(Comp3), typeof(Comp4));
        querySmall = GetEntityQuery(typeof(Comp1), typeof(Comp2), typeof(Comp3));
    }

    protected override void OnUpdate()
    {
        TestBig();
        //TestSmall();
    }

    public void TestBig()
    {
        Create(CalculateArchetype, 250000);
        Destroy(queryBig);
    }

    public void TestSmall()
    {
        Create(CalculateArchetypeSmall, 250000);
        Destroy(querySmall);
    }

    public void Create(EntityArchetype archetype, int count)
    {
        Profiler.BeginSample("EntityManager.CreateEntity");
        var entities = EntityManager.CreateEntity(archetype, count, Allocator.Temp);
        Profiler.EndSample();

        entities.Dispose();
    }

    public void Destroy(EntityQuery query)
    {
        EntityManager.DestroyEntity(query);
    }
}

Instantiation without any checks/safety results in 3.73ms for TestSmall and 5.75ms for TestBig. The big comp sadly takes around 2ms more, maybe I can find ways to not make it as big or not use it at all but that’s the current state.

Adding to that, I’ve played around with the idea to not have them destroyed and just enabling/disabling them, circumventing the whole instantiate problem.

My tests show this works at around 1ms. Which fits in my budget. I’m happy that’s one possible solution but feels bad man, back to pooling.

Out of curiosity, how long does a single-thread process take to memcpy 45 MB on your system?

Hm, yeah, excellent question. Not sure why I didn’t test that.

Allocating the array costs the most. I’ve tested with 180 bytes and 250k count. That makes the EntityManager allocation much faster than expected. Now I’m not sure why the malloc is that slow.

Init:

MemCpyTemplate = new NativeArray<byte>(180, Allocator.Persistent);
        for (int i =0; i < MemCpyTemplate.Length;i++)
        {
            MemCpyTemplate[i] = (byte) Random.Range(0, 256);
        }

actual test:

public void NativeArrayMemcpyTest()
    {
        Profiler.BeginSample("Memcpy malloc");
        int count = 250000;

        NativeArray<byte> MemTestArray = new NativeArray<byte>(180 * count, Allocator.Temp);
        Profiler.EndSample();
        Profiler.BeginSample("Memcpy test");
        unsafe
        {
            var memTestPtr = (byte*) MemTestArray.GetUnsafePtr();
            var templateArrayPtry = MemCpyTemplate.GetUnsafeReadOnlyPtr();

            var size = UnsafeUtility.SizeOf<byte>();

            UnsafeUtility.MemCpyReplicate(memTestPtr, templateArrayPtry, 180, count);

            // for (int i =0; i < count; i++)
            // {
            //     UnsafeUtility.MemCpy(memTestPtr, memTestArrayPtry, 180 * size);
            //     memTestPtr += 180;
            // }
        }
        MemTestArray.Dispose();

        Profiler.EndSample();
    }

Well, the principle is working out of “disabling” the entities. Making structural changes to entities is only fast when you can work on an EntityQuery. Using NativeArray it’s just as slow.
But with only an EntityQuery I can’t make this logic work. It’s too crude and too many entities would be enabled when only a fraction is needed in a frame. Because, let’s say, you have a pool of 100 and in one frame only 11 entities are needed. The EntityQuery has no parameter for a count. This would be really nice to have but I think that’s not really feasible when looking at the code.

So, as there’s no good way to have a specific set of active entities within chunks I’ll be falling back to ChunkComponent. I think that’s the magic key to make this as fast as possible. I can write back an activeEntities count and only iterate on this count, ignoring the rest in the chunk. That way I don’t even need a tag.
Let’s just hope setting the ChunkComponent data is fast enough overall.

edit: It is really fast! I was able now to bring it down to <2ms for 250k entities. Writing the chunk data and setting data is really not a problem. Now I just need to replace the NativeQueue with something like a NativeStream, I heard it’s a little faster to write/read.

This seems like the magic ingredient to make a really fast event system or really any logic that requires high frequencey adding/removing tags. It’s certainly not as cumbersome as pooling with MonoBehaviours/GameObjects.