NativeHashMap.ParallelWriter TryAdd is NOT Thread Safe (with repo)

Follow up to Infinite loop in NativeHashMap TryAdd
Sorry for making another post, but this is kind of a serious issue has been plaguing our project for 6 months (I’ve had concerns about this for a while due to some weird behavior randomly appearing elsewhere) but I finally got a simple repo.

Tested in Unity 2019.3.9f1

Add these to your package manifest
“com.unity.burst”: “1.3.0-preview.12”,
“com.unity.entities”: “0.9.1-preview.15”,
this brings in collections 0.7.1-preview.3

(not using entities 0.10 due to another bug but this still exists in the latest version)

Add this script

-edit- I have a more consistent repo here:

using Unity.Burst;
using Unity.Collections;
using Unity.Entities;
using UnityEngine;

public class HashMapSystem : SystemBase
{
    private EntityQuery query;

    protected override void OnCreate()
    {
        query = GetEntityQuery(ComponentType.ReadOnly<CommonComponent>());

        var archetype0 = EntityManager.CreateArchetype(typeof(CommonComponent));
        EntityManager.CreateEntity(archetype0, 4000, Allocator.Temp);

        var archetype1 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1));
        EntityManager.CreateEntity(archetype1, 4000, Allocator.Temp);

        var archetype2 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1), typeof(Component2));
        EntityManager.CreateEntity(archetype2, 4000, Allocator.Temp);

        var archetype3 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1), typeof(Component2), typeof(Component3));
        EntityManager.CreateEntity(archetype3, 4000, Allocator.Temp);
    }

    protected override void OnUpdate()
    {
        var hashMap = new NativeHashMap<EntityArchetype, ushort>(query.CalculateChunkCount(), Allocator.TempJob);

        Dependency = new WriteJob
            {
                HashMap = hashMap.AsParallelWriter(),
            }
            .Schedule(query, Dependency);

        Dependency = new ReadJob
            {
                HashMap = hashMap,
            }
            .Schedule(query, Dependency);

        hashMap.Dispose(Dependency);
    }

    [BurstCompile]
    private struct WriteJob : IJobChunk
    {
        public NativeHashMap<EntityArchetype, ushort>.ParallelWriter HashMap;

        public void Execute(ArchetypeChunk chunk, int chunkIndex, int firstEntityIndex)
        {
            DoJob(chunk.Archetype);
        }

        private void DoJob(EntityArchetype archetype)
        {
            HashMap.TryAdd(archetype, 1);
        }
    }

    [BurstCompile]
    private struct ReadJob : IJobChunk
    {
        [ReadOnly] public NativeHashMap<EntityArchetype, ushort> HashMap;

        public void Execute(ArchetypeChunk chunk, int chunkIndex, int firstEntityIndex)
        {
            if (!HashMap.ContainsKey(chunk.Archetype))
            {
                Debug.LogError("FAILED STATE");
            }
        }
    }

    private struct CommonComponent : IComponentData{}

    private struct Component1 : IComponentData
    {
        public int Value1;
    }

    private struct Component2 : IComponentData
    {
        public int Value1;
        public int Value2;
    }

    private struct Component3 : IComponentData
    {
        public int Value1;
        public int Value2;
        public int Value3;

    }
}

Hit play and let it run, it’ll fail eventually.
On my machine with a 3900X (lots of cores) it only takes a few seconds to fail, but it might take longer on a machine with less cores.

5824054--617233--upload_2020-5-9_11-43-22.png

I’m going to log a bug report shortly.

Case 1245648

3 Likes

wow

I tried that and did not get any problem. I have a 4 core CPU (i5-4670K). It was running for almost an hour.
Tested using Unity 2020.0a10 with both Entities 0.9.1-preview.15 and Entities 0.10.0-preview.6

Interesting, I was originally going to write this test with a bunch of jobs that changed the archetypes every frame to simulate our project better but it failed very quickly without changing anything for me.

I just had 1 test that wouldn’t fail in 5 minutes, so I added a job to mix up the chunks constantly and I can consistently fail in a few seconds.

19.3.9f1

5827030--617701--upload_2020-5-10_7-51-0.png

5827030--617704--upload_2020-5-10_7-51-53.png

19.3.13f1

5827030--617707--upload_2020-5-10_7-53-11.png

20.2.0a10

5827030--617710--upload_2020-5-10_7-55-24.png

New repo if you want to try it

using Unity.Burst;
using Unity.Collections;
using Unity.Entities;
using UnityEngine;
using Random = Unity.Mathematics.Random;

public class HashMapSystem : SystemBase
{
    private EntityQuery query;
    private EndSimulationEntityCommandBufferSystem bufferSystem;

    protected override void OnCreate()
    {
        Debug.Log("System Start");

        query = GetEntityQuery(ComponentType.ReadOnly<CommonComponent>());

        var archetype0 = EntityManager.CreateArchetype(typeof(CommonComponent));
        EntityManager.CreateEntity(archetype0, 4000, Allocator.Temp);

        var archetype1 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1));
        EntityManager.CreateEntity(archetype1, 4000, Allocator.Temp);

        var archetype2 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1), typeof(Component2));
        EntityManager.CreateEntity(archetype2, 4000, Allocator.Temp);

        var archetype3 = EntityManager.CreateArchetype(typeof(CommonComponent), typeof(Component1), typeof(Component2), typeof(Component3));
        EntityManager.CreateEntity(archetype3, 4000, Allocator.Temp);

        bufferSystem = World.GetOrCreateSystem<EndSimulationEntityCommandBufferSystem>();
    }

    protected override void OnUpdate()
    {
        var hashMap = new NativeHashMap<EntityArchetype, ushort>(query.CalculateChunkCount(), Allocator.TempJob);

        Dependency = new WriteJob
            {
                HashMap = hashMap.AsParallelWriter(),
            }
            .Schedule(query, Dependency);

        Dependency = new ReadJob
            {
                HashMap = hashMap,
            }
            .Schedule(query, Dependency);

        hashMap.Dispose(Dependency);

        var random = new Random((uint)UnityEngine.Random.Range(1, int.MaxValue));

        var commandBuffer = bufferSystem.CreateCommandBuffer().ToConcurrent();

        Entities.WithAll<CommonComponent>().ForEach((Entity entity, int entityInQueryIndex) =>
        {
            random.InitState((uint)(random.state + entityInQueryIndex));

            switch (random.NextInt(0, 6))
            {
                case 0:
                    commandBuffer.AddComponent<Component1>(entityInQueryIndex, entity);
                    return;
                case 1:
                    commandBuffer.RemoveComponent<Component1>(entityInQueryIndex, entity);
                    return;
                case 2:
                    commandBuffer.AddComponent<Component2>(entityInQueryIndex, entity);
                    return;
                case 3:
                    commandBuffer.RemoveComponent<Component2>(entityInQueryIndex, entity);
                    return;
                case 4:
                    commandBuffer.AddComponent<Component3>(entityInQueryIndex, entity);
                    return;
                case 5:
                    commandBuffer.RemoveComponent<Component3>(entityInQueryIndex, entity);
                    return;
            }

        }).ScheduleParallel();

        bufferSystem.AddJobHandleForProducer(Dependency);
    }

    [BurstCompile]
    private struct WriteJob : IJobChunk
    {
        public NativeHashMap<EntityArchetype, ushort>.ParallelWriter HashMap;

        public void Execute(ArchetypeChunk chunk, int chunkIndex, int firstEntityIndex)
        {
            DoJob(chunk.Archetype);
        }

        private void DoJob(EntityArchetype archetype)
        {
            HashMap.TryAdd(archetype, 1);
        }
    }

    [BurstCompile]
    private struct ReadJob : IJobChunk
    {
        [ReadOnly] public NativeHashMap<EntityArchetype, ushort> HashMap;

        public void Execute(ArchetypeChunk chunk, int chunkIndex, int firstEntityIndex)
        {
            if (!HashMap.ContainsKey(chunk.Archetype))
            {
                Debug.LogError("FAILED STATE");
            }
        }
    }

    private struct CommonComponent : IComponentData{}

    private struct Component1 : IComponentData
    {
        public int Value1;
    }

    private struct Component2 : IComponentData
    {
        public int Value1;
        public int Value2;
    }

    private struct Component3 : IComponentData
    {
        public int Value1;
        public int Value2;
        public int Value3;
    }
}

Should add, even setting my

JobsUtility.JobWorkerCount = 4;

still triggers pretty fast

5827054--617719--upload_2020-5-10_8-4-50.png

Well I’m on 3900X too. WriteJob ReadJob wihtout code below (mixing components) works without any problems, but with components mixing I see that problem, and after that error and couple of seconds my editor stuck like it dropped in to infinite loop.I guess problem here might be in chunk fragmentation itself which can cause Archetype pointer corruption somehow…
EDIT: Ah without chunk mixing it also dropped to Failed State after couple of minutes.

I guess it maybe somehow related to fact that EntityArchetype contains Archetype pointer and IEquateble use Archetype pointer for comparison. But can’t see how. As chunks should be stable in jobs execution path…

That is the original issue I reported here: https://discussions.unity.com/t/789733

The TryAdd corrupts the hashmap and the nextptr points to itself and just infinite loops.

That’s the other thing. I have a bunch of evidence of actual chunks being corrupted at the moment but I don’t have a repo so I haven’t posted it.

I’ve had to add this to our project in ChunkDataUtility

public static byte* GetComponentDataWithTypeRW(Chunk* chunk, int index, int typeIndex, uint globalSystemVersion)
        {
            if (chunk == null)
            {
                TypeManager.TypeInfo t = TypeManager.GetTypeInfo(typeIndex);
                Debug.LogError($"Whoops you just crashed but I saved you. In return can you please write TypeInfo {t.TypeIndex} in the console and pass it to Tim and mention null.");
                return null;
            }

            var indexInTypeArray = GetIndexInTypeArray(chunk->Archetype, typeIndex);

            if (indexInTypeArray < 0) {
                TypeManager.TypeInfo t = TypeManager.GetTypeInfo(typeIndex);
                Debug.LogError($"Whoops you just crashed but I saved you. In return can you please write TypeInfo {t.TypeIndex} in the console and pass it to Tim and mention -1.");
                return null;
            }

            var offset = chunk->Archetype->Offsets[indexInTypeArray];
            var sizeOf = chunk->Archetype->SizeOfs[indexInTypeArray];

            // Write Component to Chunk. ChangeVersion:Yes OrderVersion:No
            chunk->SetChangeVersion(indexInTypeArray, globalSystemVersion);

            return chunk->Buffer + (offset + sizeOf * index);
        }

because once about every 40 minutes on average, somehow EntityCommandBuffer.AddComponent fails during the SetComponent part because the typeIndex is not found on the chunk even though the component should have just been added… how is that possible?

I had been working on assumption that something in our project was somehow breaking it but after spending the last week debugging it and discovering the particular issue in this thread, I’m starting to think otherwise.

-edit-

I should add, this seems to happen with or without burst.

3 Likes

Yep I see, I pointed Branimir to this thread, hope He will look at this thread after weekend, it would be very kind of him

1 Like

The TryAdd corrupts the hashmap and the nextptr points to itself and just infinite loops.

Does TryAdd ever return false?

In code:
var hashMap = new NativeHashMap<EntityArchetype, ushort>(query.CalculateChunkCount(), Allocator.TempJob);

  1. Does it happen if you double the capacity?
  2. Does it happen if you increase capacity by 1?

The infinite loop is somewhat of a rare case for me, I’ve only seen it a couple of times, mostly in a build which is when I hooked up a profiler and stepped through and first noticed it.

From this repo, TryAdd returns false about 3/4 of the time.

+1 still enters failed state
x2 still enters failed state

Note: Infinite loop happens 100% on my machine with code above

This implies it might be race condition somewhere. At first I thought it’s issue with tight capacity.

I plopped your code in one of our Hello Cube examples, but I don’t get the issue at all. 2020.1.0b8 + latest packages (but nothing really changed in *HashMap.TryAdd implementation for a while).

Ok I reproduced it, it takes a while in example I’m using…

2 Likes

Are you running the original or the revised repo? The revised repo here:

seems to reproduces it much faster in general.

-edit-

As for nothing changing in a while, this has been an issue for months but we’ve been working around it. This is a comment from another developer from our code relating to the workaround that was written ~6 months ago.

// We are in an error state! I'm not sure why this sometimes happens, seems to go back to an internal unity error in NativeHashMap.GetKeyArray

not claiming GetKeyArray is the issue, that was just his guess at the time

I’m just in the process of trying to get a stable build running for QA/publishers as we approach content lock so I’m actually investigating problems in detail now which is why I spent a bunch of time trying to get an easy repo for you guys.

I was running original code.

Can you try to go to UnsafeHashMap.cs and repleace TryAddAtomic with this:

        internal static unsafe bool TryAddAtomic(UnsafeHashMapData* data, TKey key, TValue item, int threadIndex)
        {
            TValue tempItem;
            NativeMultiHashMapIterator<TKey> tempIt;
            if (TryGetFirstValueAtomic(data, key, out tempItem, out tempIt))
            {
                return false;
            }

            // Allocate an entry from the free list
            int idx = AllocEntry(data, threadIndex);

            int bucket = key.GetHashCode() & data->bucketCapacityMask;
            // Add the index to the hash-map
            int* buckets = (int*)data->buckets;

            if (Interlocked.CompareExchange(ref buckets[bucket], idx, -1) != -1)
            {
                int* nextPtrs = (int*)data->next;

                do
                {
                    nextPtrs[idx] = buckets[bucket];

                    if (TryGetFirstValueAtomic(data, key, out tempItem, out tempIt))
                    {
                        // Put back the entry in the free list if someone else added it while trying to add
                        do
                        {
                            nextPtrs[idx] = data->firstFreeTLS[threadIndex * UnsafeHashMapData.IntsPerCacheLine];
                        }
                        while (Interlocked.CompareExchange(
                            ref data->firstFreeTLS[threadIndex * UnsafeHashMapData.IntsPerCacheLine]
                            , idx
                            , nextPtrs[idx]
                               ) != nextPtrs[idx]
                        );

                        return false;
                    }
                }
                while (Interlocked.CompareExchange(ref buckets[bucket], idx, nextPtrs[idx]) != nextPtrs[idx]);
            }

            // Write the new value to the entry
            UnsafeUtility.WriteArrayElement(data->keys, idx, key);
            UnsafeUtility.WriteArrayElement(data->values, idx, item);

            return true;
        }

This is not solution, just a test…

I’ll try revisited example now.

Running either seems to have similar failure rate. But once I apply this change to TryAddAtomic it doesn’t happen anymore.

Still failed for me using the TryAddAtomic you posted.

5839936--619798--upload_2020-5-13_10-23-18.png

What I’ve realized is it fails a lot faster for some reason with Entities 0.9.1 and Collections 0.7.1 than Entities 0.10 and Collections 0.8 (but it does still fail eventually with 0.10/0.8)

Entities 0.10/Collections 0.8 took 90s to 4 minutes for me over 3 tests
vs
Entities 0.9.1/Collections 0.7 took 5-20 seconds to fail over a bunch of tests.

This is a pretty consistent difference.

Update: Figured out repro, and exact issue, but still don’t have proper fix for it.

3 Likes

This is fixed, and it will be available in the next release. Thanks for concise repro case!

10 Likes