[Sources] Optimization for LocalToParentSystem.UpdateHierarchy

Hi everyone!

I’ve been doing some work on my new animation system called “Kinemation” this weekend. I wanted to experiment with a technique that involves keeping all bones in the hierarchy since that can be a convenient representation for game logic. However, it was nearly impossible to profile the GPU uploads because LocalToParentSystem.UpdateHierarchy was too slow. I have 29,200 skeletons (the max I stuff into a compute buffer) and the hierarchy update was taking a full 16 ms on my machine.

Oddly though, I got the same performance regardless of whether my skeletons were moving or not. So I checked the change version on LocalToWorld, and it wasn’t being updated when the entities weren’t moving. That meant the system was skipping writes like it was supposed to.

It turns out that the system touches the LocalToWorld of every entity in the hierarchy. For entities that don’t need updates, it only reads from it to propagate it to the children. For entities that do update, it reads from LocalToParent and writes to LocalToWorld. Interestingly enough, my out-of-order processor was loading both LocalToParent and LocalToWorld into cache at the same time when LocalToParent was required, and hence I saw the same random access time penalty for both cases.

My solution was to defer propagating the LocalToWorld to the child subroutine, using ref args so that only the first child needs to update it. Keep in mind that change versions are not deterministic with this system and I didn’t fix it with this optimization, so the behavior may be slightly different due to the different timings. If you don’t rely on a deterministic change version, the actually hierarchy updates should still be correct.

Anyways, try it out and let me know what kind of speeds you get. I see a speedup between 25-50% when no entities are moving, and about the same if not a slight speedup when all the roots are moving.

[UpdateInGroup(typeof(TransformSystemGroup))]
[UpdateAfter(typeof(EndFrameTRSToLocalToParentSystem))]
[UpdateBefore(typeof(EndFrameWorldToLocalSystem))]
//[UpdateBefore(typeof(EndFrameLocalToParentSystem))]
public class LocalToParentSystem2 : JobComponentSystem
{
    private EntityQuery     m_RootsQuery;
    private EntityQueryMask m_LocalToWorldWriteGroupMask;
    private EntityQuery     m_ChildrenQuery;

    // LocalToWorld = Parent.LocalToWorld * LocalToParent
    [BurstCompile]
    struct UpdateHierarchy : IJobEntityBatch
    {
        [ReadOnly] public ComponentTypeHandle<LocalToWorld>      LocalToWorldTypeHandle;
        [ReadOnly] public BufferTypeHandle<Child>                ChildTypeHandle;
        [ReadOnly] public BufferFromEntity<Child>                ChildFromEntity;
        [ReadOnly] public ComponentDataFromEntity<LocalToParent> LocalToParentFromEntity;
        [ReadOnly] public EntityQueryMask                        LocalToWorldWriteGroupMask;
        public uint                                              LastSystemVersion;

        [NativeDisableContainerSafetyRestriction]
        public ComponentDataFromEntity<LocalToWorld> LocalToWorldFromEntity;

        void ChildLocalToWorld(ref float4x4 parentLocalToWorld, Entity entity, bool updateChildrenTransform, Entity parent, ref bool parentLtwValid)
        {
            updateChildrenTransform = updateChildrenTransform || LocalToParentFromEntity.DidChange(entity, LastSystemVersion);

            float4x4 localToWorldMatrix = default;
            bool     ltwIsValid         = false;

            if (updateChildrenTransform && LocalToWorldWriteGroupMask.Matches(entity))
            {
                if (!parentLtwValid)
                {
                    parentLocalToWorld = LocalToWorldFromEntity[parent].Value;
                    parentLtwValid     = true;
                }
                var localToParent              = LocalToParentFromEntity[entity];
                localToWorldMatrix             = math.mul(parentLocalToWorld, localToParent.Value);
                ltwIsValid                     = true;
                LocalToWorldFromEntity[entity] = new LocalToWorld { Value = localToWorldMatrix };
            }
            else  //This entity has a component with the WriteGroup(LocalToWorld)
            {
                updateChildrenTransform = updateChildrenTransform || LocalToWorldFromEntity.DidChange(entity, LastSystemVersion);
            }
            if (ChildFromEntity.HasComponent(entity))
            {
                var children = ChildFromEntity[entity];
                for (int i = 0; i < children.Length; i++)
                {
                    ChildLocalToWorld(ref localToWorldMatrix, children[i].Value, updateChildrenTransform, entity, ref ltwIsValid);
                }
            }
        }

        public void Execute(ArchetypeChunk batchInChunk, int batchIndex)
        {
            bool updateChildrenTransform =
                batchInChunk.DidChange<LocalToWorld>(LocalToWorldTypeHandle, LastSystemVersion) ||
                batchInChunk.DidChange<Child>(ChildTypeHandle, LastSystemVersion);

            var  chunkLocalToWorld = batchInChunk.GetNativeArray(LocalToWorldTypeHandle);
            var  chunkChildren     = batchInChunk.GetBufferAccessor(ChildTypeHandle);
            bool ltwIsValid        = true;
            for (int i = 0; i < batchInChunk.Count; i++)
            {
                var localToWorldMatrix = chunkLocalToWorld[i].Value;
                var children           = chunkChildren[i];
                for (int j = 0; j < children.Length; j++)
                {
                    ChildLocalToWorld(ref localToWorldMatrix, children[j].Value, updateChildrenTransform, Entity.Null, ref ltwIsValid);
                }
            }
        }
    }

    protected override void OnCreate()
    {
        m_RootsQuery = GetEntityQuery(new EntityQueryDesc
        {
            All = new ComponentType[]
            {
                ComponentType.ReadOnly<LocalToWorld>(),
                ComponentType.ReadOnly<Child>()
            },
            None = new ComponentType[]
            {
                typeof(Parent)
            },
            Options = EntityQueryOptions.FilterWriteGroup
        });

        m_ChildrenQuery = GetEntityQuery(new EntityQueryDesc
        {
            All = new ComponentType[]
            {
                typeof(LocalToWorld),
                ComponentType.ReadOnly<LocalToParent>(),
                ComponentType.ReadOnly<Parent>()
            },
            Options = EntityQueryOptions.FilterWriteGroup
        });
        m_LocalToWorldWriteGroupMask = EntityManager.GetEntityQueryMask(m_ChildrenQuery);
    }

    protected override JobHandle OnUpdate(JobHandle inputDeps)
    {
        var localToWorldType        = GetComponentTypeHandle<LocalToWorld>(true);
        var childType               = GetBufferTypeHandle<Child>(true);
        var childFromEntity         = GetBufferFromEntity<Child>(true);
        var localToParentFromEntity = GetComponentDataFromEntity<LocalToParent>(true);
        var localToWorldFromEntity  = GetComponentDataFromEntity<LocalToWorld>();

        var updateHierarchyJob = new UpdateHierarchy
        {
            LocalToWorldTypeHandle     = localToWorldType,
            ChildTypeHandle            = childType,
            ChildFromEntity            = childFromEntity,
            LocalToParentFromEntity    = localToParentFromEntity,
            LocalToWorldFromEntity     = localToWorldFromEntity,
            LocalToWorldWriteGroupMask = m_LocalToWorldWriteGroupMask,
            LastSystemVersion          = LastSystemVersion
        };
        inputDeps = updateHierarchyJob.ScheduleParallel(m_RootsQuery, 1, inputDeps);
        return inputDeps;
    }
}

I have a couple ideas for how to further improve this, but I need some time to test and experiment.

7 Likes

Alright. So I did some more investigation, and I realized that the change version number bug on LocalToWorld may have just been a small implementation bug rather than a design oversight. A child that does not need an update does not need to check if LocalToWorld changed. That is only for entities which don’t match the query and won’t have their LocalToWorld written. By making them no longer share logic (which is more obvious now that they don’t need to fetch LocalToWorld), the race condition disappears and entities being updated no longer dirty their chunk neighbors into flipping updateChildrenTransform true.

The end result is a deterministic and slightly faster version:

[UpdateInGroup(typeof(TransformSystemGroup))]
[UpdateAfter(typeof(EndFrameTRSToLocalToParentSystem))]
[UpdateBefore(typeof(EndFrameWorldToLocalSystem))]
//[UpdateBefore(typeof(EndFrameLocalToParentSystem))]
public class LocalToParentSystem2 : JobComponentSystem
{
    private EntityQuery     m_RootsQuery;
    private EntityQueryMask m_LocalToWorldWriteGroupMask;
    private EntityQuery     m_ChildrenQuery;

    // LocalToWorld = Parent.LocalToWorld * LocalToParent
    [BurstCompile]
    struct UpdateHierarchy : IJobEntityBatch
    {
        [ReadOnly] public ComponentTypeHandle<LocalToWorld>      LocalToWorldTypeHandle;
        [ReadOnly] public BufferTypeHandle<Child>                ChildTypeHandle;
        [ReadOnly] public BufferFromEntity<Child>                ChildFromEntity;
        [ReadOnly] public ComponentDataFromEntity<LocalToParent> LocalToParentFromEntity;
        [ReadOnly] public ComponentDataFromEntity<Parent>        ParentFromEntity;
        [ReadOnly] public EntityQueryMask                        LocalToWorldWriteGroupMask;
        public uint                                              LastSystemVersion;

        [NativeDisableContainerSafetyRestriction]
        public ComponentDataFromEntity<LocalToWorld> LocalToWorldFromEntity;

        void ChildLocalToWorld(ref float4x4 parentLocalToWorld, Entity entity, bool updateChildrenTransform, Entity parent, ref bool parentLtwValid)
        {
            updateChildrenTransform = updateChildrenTransform || LocalToParentFromEntity.DidChange(entity, LastSystemVersion) || ParentFromEntity.DidChange(entity,
                                                                                                                                                            LastSystemVersion);

            float4x4 localToWorldMatrix = default;
            bool     ltwIsValid         = false;

            bool isDependent = LocalToWorldWriteGroupMask.Matches(entity);
            if (updateChildrenTransform && isDependent)
            {
                if (!parentLtwValid)
                {
                    parentLocalToWorld = LocalToWorldFromEntity[parent].Value;
                    parentLtwValid     = true;
                }
                var localToParent              = LocalToParentFromEntity[entity];
                localToWorldMatrix             = math.mul(parentLocalToWorld, localToParent.Value);
                ltwIsValid                     = true;
                LocalToWorldFromEntity[entity] = new LocalToWorld { Value = localToWorldMatrix };
            }
            else if (!isDependent)  //This entity has a component with the WriteGroup(LocalToWorld)
            {
                updateChildrenTransform = updateChildrenTransform || LocalToWorldFromEntity.DidChange(entity, LastSystemVersion);
            }
            if (ChildFromEntity.HasComponent(entity))
            {
                var children = ChildFromEntity[entity];
                for (int i = 0; i < children.Length; i++)
                {
                    ChildLocalToWorld(ref localToWorldMatrix, children[i].Value, updateChildrenTransform, entity, ref ltwIsValid);
                }
            }
        }

        public void Execute(ArchetypeChunk batchInChunk, int batchIndex)
        {
            bool updateChildrenTransform =
                batchInChunk.DidChange<LocalToWorld>(LocalToWorldTypeHandle, LastSystemVersion) ||
                batchInChunk.DidChange<Child>(ChildTypeHandle, LastSystemVersion);

            var  chunkLocalToWorld = batchInChunk.GetNativeArray(LocalToWorldTypeHandle);
            var  chunkChildren     = batchInChunk.GetBufferAccessor(ChildTypeHandle);
            bool ltwIsValid        = true;
            for (int i = 0; i < batchInChunk.Count; i++)
            {
                var localToWorldMatrix = chunkLocalToWorld[i].Value;
                var children           = chunkChildren[i];
                for (int j = 0; j < children.Length; j++)
                {
                    ChildLocalToWorld(ref localToWorldMatrix, children[j].Value, updateChildrenTransform, Entity.Null, ref ltwIsValid);
                }
            }
        }
    }

    protected override void OnCreate()
    {
        m_RootsQuery = GetEntityQuery(new EntityQueryDesc
        {
            All = new ComponentType[]
            {
                ComponentType.ReadOnly<LocalToWorld>(),
                ComponentType.ReadOnly<Child>()
            },
            None = new ComponentType[]
            {
                typeof(Parent)
            },
            Options = EntityQueryOptions.FilterWriteGroup
        });

        m_ChildrenQuery = GetEntityQuery(new EntityQueryDesc
        {
            All = new ComponentType[]
            {
                typeof(LocalToWorld),
                ComponentType.ReadOnly<LocalToParent>(),
                ComponentType.ReadOnly<Parent>()
            },
            Options = EntityQueryOptions.FilterWriteGroup
        });
        m_LocalToWorldWriteGroupMask = EntityManager.GetEntityQueryMask(m_ChildrenQuery);
    }

    protected override JobHandle OnUpdate(JobHandle inputDeps)
    {
        var localToWorldType        = GetComponentTypeHandle<LocalToWorld>(true);
        var childType               = GetBufferTypeHandle<Child>(true);
        var childFromEntity         = GetBufferFromEntity<Child>(true);
        var localToParentFromEntity = GetComponentDataFromEntity<LocalToParent>(true);
        var parentFromEntity        = GetComponentDataFromEntity<Parent>(true);
        var localToWorldFromEntity  = GetComponentDataFromEntity<LocalToWorld>();

        var updateHierarchyJob = new UpdateHierarchy
        {
            LocalToWorldTypeHandle     = localToWorldType,
            ChildTypeHandle            = childType,
            ChildFromEntity            = childFromEntity,
            LocalToParentFromEntity    = localToParentFromEntity,
            ParentFromEntity           = parentFromEntity,
            LocalToWorldFromEntity     = localToWorldFromEntity,
            LocalToWorldWriteGroupMask = m_LocalToWorldWriteGroupMask,
            LastSystemVersion          = LastSystemVersion
        };
        inputDeps = updateHierarchyJob.ScheduleParallel(m_RootsQuery, 1, inputDeps);
        return inputDeps;
    }
}

I think that’s the limit in terms of performance without going into assembly-level optimizations or radically changing the algorithm. I tried comparing the previous LocalToWorld to the new one on every write to catch DidChange false positives on the entity level. Unfortunately that seemed to do more harm than good.

I have an idea using a phased algorithm which might still be able to yield more performance. Stay tuned!

Edit: Code updated with a fix for detecting changed parents.

5 Likes

Hey everyone.

So this will probably be my final update in this space in the near term. I tried a lot of stuff this weekend to try and make this faster, and most of it failed.

I made the observation that when instantiating many hierarchies, it is likely a chunk will hold entities at the same level in the hierarchy. For this reason, it should be possible to iterate by chunks and get hardware prefetching bonuses for LocalToWorld and LocalToParent instead of those being random access (at the cost of the parent’s LocalToWorld).

I tried a whole bunch of strategies of walking down and up the hierarchy, dumping metadata into all sorts of containers, and sorting using several different algorithms to batch up data. I tried storing parent LocalToWorld pointers, hashing chunks, and doing all the change filters and write group checks per chunk instead of per instance. But despite that, ComponentDataFromEntity is fast.

I made all of these attempts trying to stick to the original system constraints as much as possible. However, it just wasn’t going to work. So I finally decided to allow myself to make some archetype changes, and I got something!

The idea is that by caching the depth levels at both entity and chunk granularity, I could completely bypass the hierarchy walking step for most chunks. This solution performs twice as fast as Unity’s under heavy workloads. However, it doesn’t scale down nearly as well since it schedules 52 jobs.

I wrote this using a couple of utilities from my framework. You can replace the EntityQuery code with the traditional syntax. And UnsafeParallelBlockList is just a slightly different flavor of UnsafeStream. Anyways, I’m curious to know if you also see a speedup or if the large job count makes it not worthwhile.

[UpdateInGroup(typeof(TransformSystemGroup))]
[UpdateAfter(typeof(EndFrameLocalToParentSystem))]
[UpdateBefore(typeof(EndFrameWorldToLocalSystem))]
public unsafe class LatiosLocalToParentSystem3 : SubSystem
{
    EntityQuery     m_childWithParentDependencyQuery;
    EntityQueryMask m_childWithParentDependencyMask;

    EntityQuery m_childQuery;
    EntityQuery m_childMissingDepthQuery;

    const int kMaxDepthIterations = 16;

    protected override void OnCreate()
    {
        m_childWithParentDependencyQuery = Fluent.WithAll<LocalToWorld>(false).WithAll<LocalToParent>(true).WithAll<Parent>(true).UseWriteGroups().Build();
        m_childWithParentDependencyMask  = m_childWithParentDependencyQuery.GetEntityQueryMask();
        m_childQuery                     = Fluent.WithAll<Parent>(true).WithAll<Depth>(true).Build();
        m_childMissingDepthQuery         = Fluent.WithAll<Parent>(true).Without<Depth>().Build();
    }

    protected override void OnUpdate()
    {
        if (!m_childMissingDepthQuery.IsEmptyIgnoreFilter)
        {
            var depthTypes = new ComponentTypes(typeof(Depth), ComponentType.ChunkComponent<ChunkDepthMask>());
            EntityManager.AddComponent(m_childMissingDepthQuery, depthTypes);
        }

        var parentHandle              = GetComponentTypeHandle<Parent>(true);
        var parentCdfe                = GetComponentDataFromEntity<Parent>(true);
        var childHandle               = GetBufferTypeHandle<Child>(true);
        var childBfe                  = GetBufferFromEntity<Child>(true);
        var depthWriteHandle          = GetComponentTypeHandle<Depth>(false);
        var depthReadHandle           = GetComponentTypeHandle<Depth>(true);
        var depthCdfe                 = GetComponentDataFromEntity<Depth>(false);
        var chunkDepthMaskWriteHandle = GetComponentTypeHandle<ChunkDepthMask>(false);
        var chunkDepthMaskReadHandle  = GetComponentTypeHandle<ChunkDepthMask>(true);
        var ltwWriteHandle            = GetComponentTypeHandle<LocalToWorld>(false);
        var ltwReadHandle             = GetComponentTypeHandle<LocalToWorld>(true);
        var ltpHandle                 = GetComponentTypeHandle<LocalToParent>(true);
        var entityHandle              = GetEntityTypeHandle();
        var ltwWriteCdfe              = GetComponentDataFromEntity<LocalToWorld>(false);
        var ltwReadCdfe               = GetComponentDataFromEntity<LocalToWorld>(true);
        var ltpCdfe                   = GetComponentDataFromEntity<LocalToParent>(true);

        uint lastSystemVersion = LastSystemVersion;

        var blockLists = new NativeArray<UnsafeParallelBlockList>(kMaxDepthIterations, Allocator.TempJob);
        for (int i = 0; i < kMaxDepthIterations; i++)
        {
            blockLists[i] = new UnsafeParallelBlockList(sizeof(ArchetypeChunk), 64, Allocator.TempJob);
        }
        var chunkList       = new NativeList<ArchetypeChunk>(Allocator.TempJob);
        var needsUpdateList = new NativeList<bool>(Allocator.TempJob);

        Dependency = new PatchDepthsJob
        {
            parentHandle      = parentHandle,
            parentCdfe        = parentCdfe,
            childHandle       = childHandle,
            childBfe          = childBfe,
            depthCdfe         = depthCdfe,
            depthHandle       = depthWriteHandle,
            lastSystemVersion = lastSystemVersion
        }.ScheduleParallel(m_childQuery, 1, Dependency);

        Dependency = new PatchChunkDepthMasksJob
        {
            depthHandle          = depthReadHandle,
            chunkDepthMaskHandle = chunkDepthMaskWriteHandle,
            lastSystemVersion    = lastSystemVersion
        }.ScheduleParallel(m_childQuery, 1, Dependency);

        Dependency = new ScatterChunksToDepthsJob
        {
            chunkDepthMaskHandle = chunkDepthMaskReadHandle,
            blockLists           = blockLists
        }.ScheduleParallel(m_childQuery, 1, Dependency);

        for (int i = 0; i < kMaxDepthIterations; i++)
        {
            Dependency = new ConvertBlockListToArrayJob
            {
                chunkList       = chunkList,
                needsUpdateList = needsUpdateList,
                blockLists      = blockLists,
                depthLevel      = i
            }.Schedule(Dependency);

            Dependency = new CheckIfMatricesShouldUpdateForSingleDepthLevelJob
            {
                chunkList         = chunkList.AsDeferredJobArray(),
                needsUpdateList   = needsUpdateList.AsDeferredJobArray(),
                depth             = i,
                depthHandle       = depthReadHandle,
                entityHandle      = entityHandle,
                lastSystemVersion = lastSystemVersion,
                ltpHandle         = ltpHandle,
                ltwCdfe           = ltwReadCdfe,
                parentHandle      = parentHandle,
                shouldUpdateMask  = m_childWithParentDependencyMask
            }.Schedule(chunkList, 1, Dependency);

            Dependency = new UpdateMatricesOfSingleDepthLevelJob
            {
                chunkList       = chunkList.AsDeferredJobArray(),
                needsUpdateList = needsUpdateList.AsDeferredJobArray(),
                depth           = i,
                depthHandle     = depthReadHandle,
                ltpHandle       = ltpHandle,
                ltwCdfe         = ltwReadCdfe,
                ltwHandle       = ltwWriteHandle,
                parentHandle    = parentHandle,
            }.Schedule(chunkList, 1, Dependency);
        }

        Dependency = new UpdateMatricesOfDeepChildrenJob
        {
            chunkList         = chunkList.AsDeferredJobArray(),
            childBfe          = childBfe,
            childHandle       = childHandle,
            depthHandle       = depthReadHandle,
            depthLevel        = kMaxDepthIterations - 1,
            lastSystemVersion = lastSystemVersion,
            ltwCdfe           = ltwWriteCdfe,
            ltpCdfe           = ltpCdfe,
            ltwHandle         = ltwReadHandle,
            ltwWriteGroupMask = m_childWithParentDependencyMask,
            parentCdfe        = parentCdfe
        }.Schedule(chunkList, 1, Dependency);

        Dependency = blockLists.Dispose(Dependency);
        Dependency = chunkList.Dispose(Dependency);
        Dependency = needsUpdateList.Dispose(Dependency);
    }

    struct Depth : IComponentData
    {
        public byte depth;
    }

    struct ChunkDepthMask : IComponentData
    {
        public BitField32 chunkDepthMask;
    }

    [BurstCompile]
    struct PatchDepthsJob : IJobEntityBatch
    {
        [ReadOnly] public ComponentTypeHandle<Parent>                                   parentHandle;
        [ReadOnly] public ComponentDataFromEntity<Parent>                               parentCdfe;
        [ReadOnly] public BufferTypeHandle<Child>                                       childHandle;
        [ReadOnly] public BufferFromEntity<Child>                                       childBfe;
        [NativeDisableContainerSafetyRestriction] public ComponentDataFromEntity<Depth> depthCdfe;
        public ComponentTypeHandle<Depth>                                               depthHandle;

        public uint lastSystemVersion;

        public void Execute(ArchetypeChunk batchInChunk, int batchIndex)
        {
            if (!batchInChunk.DidChange(parentHandle, lastSystemVersion))
                return;

            var parents = batchInChunk.GetNativeArray(parentHandle);

            BufferAccessor<Child> childAccess         = default;
            bool                  hasChildrenToUpdate = batchInChunk.Has(childHandle);
            if (hasChildrenToUpdate)
                childAccess           = batchInChunk.GetBufferAccessor(childHandle);
            NativeArray<Depth> depths = default;

            for (int i = 0; i < batchInChunk.Count; i++)
            {
                if (IsDepthChangeRoot(parents[i].Value, out var depth))
                {
                    if (!depths.IsCreated)
                        depths = batchInChunk.GetNativeArray(depthHandle);

                    var startDepth = new Depth { depth = depth };
                    depths[i]                          = startDepth;
                    startDepth.depth++;

                    if (hasChildrenToUpdate)
                    {
                        foreach (var child in childAccess[i])
                        {
                            WriteDepthAndRecurse(child.Value, startDepth);
                        }
                    }
                }
            }
        }

        bool IsDepthChangeRoot(Entity parent, out byte depth)
        {
            var current = parent;
            depth       = 0;
            while (parentCdfe.HasComponent(current))
            {
                if (parentCdfe.DidChange(current, lastSystemVersion))
                {
                    return false;
                }
                depth++;
                current = parentCdfe[current].Value;
            }
            return true;
        }

        void WriteDepthAndRecurse(Entity child, Depth depth)
        {
            depthCdfe[child] = depth;
            depth.depth++;
            if (childBfe.HasComponent(child))
            {
                foreach (var c in childBfe[child])
                {
                    WriteDepthAndRecurse(c.Value, depth);
                }
            }
        }
    }

    [BurstCompile]
    struct PatchChunkDepthMasksJob : IJobEntityBatch
    {
        [ReadOnly] public ComponentTypeHandle<Depth> depthHandle;
        public ComponentTypeHandle<ChunkDepthMask>   chunkDepthMaskHandle;
        public uint                                  lastSystemVersion;

        public void Execute(ArchetypeChunk batchInChunk, int batchIndex)
        {
            if (batchInChunk.DidChange(depthHandle, lastSystemVersion) || batchInChunk.DidOrderChange(lastSystemVersion))
            {
                BitField32 depthMask = default;
                var        depths    = batchInChunk.GetNativeArray(depthHandle);
                for (int i = 0; i < batchInChunk.Count; i++)
                {
                    var clampDepth = math.min(kMaxDepthIterations, depths[i].depth);
                    depthMask.SetBits(clampDepth, true);
                }

                batchInChunk.SetChunkComponentData(chunkDepthMaskHandle, new ChunkDepthMask { chunkDepthMask = depthMask });
            }
        }
    }

    [BurstCompile]
    struct ScatterChunksToDepthsJob : IJobEntityBatch
    {
        [ReadOnly] public ComponentTypeHandle<ChunkDepthMask> chunkDepthMaskHandle;

        [NativeDisableParallelForRestriction, NativeDisableUnsafePtrRestriction] public NativeArray<UnsafeParallelBlockList> blockLists;

        [NativeSetThreadIndex]
        public int m_NativeThreadIndex;

        public void Execute(ArchetypeChunk batchInChunk, int batchIndex)
        {
            var mask = batchInChunk.GetChunkComponentData(chunkDepthMaskHandle).chunkDepthMask;

            for (int i = 0; i < kMaxDepthIterations; i++)
            {
                if (mask.IsSet(i))
                {
                    blockLists[i].Write(batchInChunk, m_NativeThreadIndex);
                }
            }
        }
    }

    // Todo: Make wide for each depth level?
    [BurstCompile]
    struct ConvertBlockListToArrayJob : IJob
    {
        public NativeList<ArchetypeChunk>                                               chunkList;
        public NativeList<bool>                                                         needsUpdateList;
        [NativeDisableUnsafePtrRestriction] public NativeArray<UnsafeParallelBlockList> blockLists;
        public int                                                                      depthLevel;

        public void Execute()
        {
            chunkList.Clear();
            needsUpdateList.Clear();
            int count = blockLists[depthLevel].Count();
            chunkList.ResizeUninitialized(count);
            needsUpdateList.ResizeUninitialized(count);
            blockLists[depthLevel].GetElementValues(chunkList.AsArray());
            blockLists[depthLevel].Dispose();
        }
    }

    [BurstCompile]
    struct CheckIfMatricesShouldUpdateForSingleDepthLevelJob : IJobParallelForDefer
    {
        [ReadOnly] public NativeArray<ArchetypeChunk> chunkList;

        [ReadOnly] public ComponentTypeHandle<LocalToParent>    ltpHandle;
        [ReadOnly] public ComponentTypeHandle<Parent>           parentHandle;
        [ReadOnly] public ComponentTypeHandle<Depth>            depthHandle;
        [ReadOnly] public EntityTypeHandle                      entityHandle;
        [ReadOnly] public ComponentDataFromEntity<LocalToWorld> ltwCdfe;

        public NativeArray<bool> needsUpdateList;

        public EntityQueryMask shouldUpdateMask;
        public int             depth;
        public uint            lastSystemVersion;

        public void Execute(int index)
        {
            var chunk = chunkList[index];

            if (!shouldUpdateMask.Matches(chunk.GetNativeArray(entityHandle)[0]))
            {
                needsUpdateList[index] = false;
                return;
            }

            var parents = chunk.GetNativeArray(parentHandle);
            var depths  = chunk.GetNativeArray(depthHandle);

            if (chunk.DidChange(parentHandle, lastSystemVersion) || chunk.DidChange(ltpHandle, lastSystemVersion))
            {
                // Fast path. No need to check for changes on parent.
                needsUpdateList[index] = true;
            }
            else
            {
                for (int i = 0; i < chunk.Count; i++)
                {
                    if (depth == depths[i].depth)
                    {
                        var parent = parents[i].Value;
                        if (ltwCdfe.DidChange(parent, lastSystemVersion))
                        {
                            needsUpdateList[index] = true;
                            return;
                        }
                    }
                }
                needsUpdateList[index] = false;
            }
        }
    }

    [BurstCompile]
    struct UpdateMatricesOfSingleDepthLevelJob : IJobParallelForDefer
    {
        [ReadOnly] public NativeArray<ArchetypeChunk>                                      chunkList;
        [NativeDisableContainerSafetyRestriction] public ComponentTypeHandle<LocalToWorld> ltwHandle;
        [ReadOnly] public NativeArray<bool>                                                needsUpdateList;

        [ReadOnly] public ComponentTypeHandle<LocalToParent>    ltpHandle;
        [ReadOnly] public ComponentTypeHandle<Parent>           parentHandle;
        [ReadOnly] public ComponentTypeHandle<Depth>            depthHandle;
        [ReadOnly] public ComponentDataFromEntity<LocalToWorld> ltwCdfe;

        public int depth;

        public void Execute(int index)
        {
            if (!needsUpdateList[index])
                return;
            var chunk   = chunkList[index];
            var parents = chunk.GetNativeArray(parentHandle);
            var depths  = chunk.GetNativeArray(depthHandle);
            var ltps    = chunk.GetNativeArray(ltpHandle);

            // Fast path. No need to check for changes on parent since it already happened.
            NativeArray<LocalToWorld> ltws = chunk.GetNativeArray(ltwHandle);

            for (int i = 0; i < chunk.Count; i++)
            {
                if (depth == depths[i].depth)
                {
                    ltws[i] = new LocalToWorld { Value = math.mul(ltwCdfe[parents[i].Value].Value, ltps[i].Value) };
                }
            }
        }
    }

    [BurstCompile]
    struct UpdateMatricesOfDeepChildrenJob : IJobParallelForDefer
    {
        [ReadOnly] public NativeArray<ArchetypeChunk>            chunkList;
        [ReadOnly] public ComponentTypeHandle<LocalToWorld>      ltwHandle;
        [ReadOnly] public ComponentTypeHandle<Depth>             depthHandle;
        [ReadOnly] public BufferTypeHandle<Child>                childHandle;
        [ReadOnly] public BufferFromEntity<Child>                childBfe;
        [ReadOnly] public ComponentDataFromEntity<LocalToParent> ltpCdfe;
        [ReadOnly] public ComponentDataFromEntity<Parent>        parentCdfe;
        [ReadOnly] public EntityQueryMask                        ltwWriteGroupMask;
        public uint                                              lastSystemVersion;
        public int                                               depthLevel;

        [NativeDisableContainerSafetyRestriction]
        public ComponentDataFromEntity<LocalToWorld> ltwCdfe;

        void ChildLocalToWorld(ref float4x4 parentLocalToWorld, Entity entity, bool updateChildrenTransform, Entity parent, ref bool parentLtwValid)
        {
            updateChildrenTransform = updateChildrenTransform || ltpCdfe.DidChange(entity, lastSystemVersion) || parentCdfe.DidChange(entity,
                                                                                                                                        lastSystemVersion);

            float4x4 localToWorldMatrix = default;
            bool     ltwIsValid         = false;

            bool isDependent = ltwWriteGroupMask.Matches(entity);
            if (updateChildrenTransform && isDependent)
            {
                if (!parentLtwValid)
                {
                    parentLocalToWorld = ltwCdfe[parent].Value;
                    parentLtwValid     = true;
                }
                var localToParent  = ltpCdfe[entity];
                localToWorldMatrix = math.mul(parentLocalToWorld, localToParent.Value);
                ltwIsValid         = true;
                ltwCdfe[entity]    = new LocalToWorld { Value = localToWorldMatrix };
            }
            else if (!isDependent)  //This entity has a component with the WriteGroup(LocalToWorld)
            {
                updateChildrenTransform = updateChildrenTransform || ltwCdfe.DidChange(entity, lastSystemVersion);
            }
            if (childBfe.HasComponent(entity))
            {
                var children = childBfe[entity];
                for (int i = 0; i < children.Length; i++)
                {
                    ChildLocalToWorld(ref localToWorldMatrix, children[i].Value, updateChildrenTransform, entity, ref ltwIsValid);
                }
            }
        }

        public void Execute(int index)
        {
            var  batchInChunk            = chunkList[index];
            bool updateChildrenTransform =
                batchInChunk.DidChange<LocalToWorld>(ltwHandle, lastSystemVersion) ||
                batchInChunk.DidChange<Child>(childHandle, lastSystemVersion);

            var  chunkLocalToWorld = batchInChunk.GetNativeArray(ltwHandle);
            var  depths            = batchInChunk.GetNativeArray(depthHandle);
            var  chunkChildren     = batchInChunk.GetBufferAccessor(childHandle);
            bool ltwIsValid        = true;
            for (int i = 0; i < batchInChunk.Count; i++)
            {
                if (depths[i].depth == depthLevel)
                {
                    var localToWorldMatrix = chunkLocalToWorld[i].Value;
                    var children           = chunkChildren[i];
                    for (int j = 0; j < children.Length; j++)
                    {
                        ChildLocalToWorld(ref localToWorldMatrix, children[j].Value, updateChildrenTransform, Entity.Null, ref ltwIsValid);
                    }
                }
            }
        }
    }
}
2 Likes

Any update on this?
I also have strangely bad performance of that system/job. For more LocalToParentSystem.UpdateHierarchy takes ~2.5ms each frame

I have made a couple of updates in some other threads. You can find the latest versions here which will be an official part of my next major framework release in a couple weeks. Kinemation-Skinning-Prototype/Packages/com.latios.latios-framework/Core/Core/Systems/Transforms at 063648a8598ee69ae396baa9c2f81eee967d3ae9 · Dreaming381/Kinemation-Skinning-Prototype · GitHub

“Improved” ranges from 4% to 50% faster depending on how well you keep change filters clean. “Extreme” ranges from 2X faster to 2X slower depending on how many structural and hierarchical changes you have, and has a high fixed overhead which makes it really only viable for high entity counts. I haven’t yet tested if IJobParallelForDefer can be scheduled from a Burst system. But if so, that’s a further area of improvement.

1 Like

The only thing I can’t quite get is why I have only 1 job in Profiler that actually does things. Implementation calls ScheduleParallel(), and still a single job. And this single job is the only thing among other threads (including main). Interesting…

Because version check works per chunk, I am thinking to move some components to a shared one, to split my entities into chunks better. Might improve performance a bit (at least I hope)

Is it possible that all of your objects are grouped under the same root parent in the hierarchy? This isn’t an optimal arragement because jobs are scheduled to process each root hierarchy in parallel (similarly to how it works for good old GameObjects).

https://discussions.unity.com/t/791535/2

@apkdev , well, that is true!

Most of the stuff is under World entity. And a lot of children inside Chunk and ParticlesHolder (especially the last one. ~3k root entities per game element aka player, agent or any booster).

I will continue digging that up thanks to you!

Strange. It didn’t help.
Actually, it become worse (2.49 ms)


Would be nice to fix that tail and gain a few free milliseconds :confused:

Do you have skinnedMesh models with many bones? These can quickly get a very deep hierarchy resulting in big single thread time.

1 Like

Thank you, @Arnold_2013 . I will check that.

But as a blind guess, I think the problem is somewhere else. On my screenshots profiler shows only 2 scheduled jobs, even though I would expect at least 6 (for each root hierarchy).

Nope. Nothing like that. I moved my entities with MeshRenderer into separate hierarchies, and still 1 single job.
I think we need help from the Unity guys here. I am sure this is a common problem for every DOTS project.

Edit: Oh, actually now it is 3 jobs, but still 1 is about 2.3ms (another one is ~0.3ms on main thread)

First off, Unity’s system divides work at granularity of chunks of roots (that is entities that don’t have Parent component).

Second, due to a bug in change filters, Unity recurses through all entities in the hierarchy and randomly accesses their LocalToWorld values. This is what my Improved Transforms fixes. However, even that still has to recurse the hierarchy because at any depth something could have changed.

I’m not sure if you have enough entities to justify it, but my Extreme Transforms isn’t bound by these rules. Instead, it divides entities into chunks at each depth level, processing level by level for the first 16 depths before switching over to the recursive approach (Improved version). You could probably modify the code to reduce the number of depths to 2 or 3 and make it more performant for smaller entity counts while still bypassing your world root.

Latest versions can be found in my framework. Latios-Framework/Core/Core/Systems/Transforms at v0.5.6 · Dreaming381/Latios-Framework · GitHub
Installers: Latios-Framework/Core/Core/Framework/CoreBootstrap.cs at v0.5.6 · Dreaming381/Latios-Framework · GitHub

2 Likes