Please help me understand why second job is slower

I have decided to implement Conway's Game of life using the job system and optimise it best as I can.
I have created the first job GameOfLifeJob (source code below). It uses pre-calculated codes for neighbours so there are no ifs checking for map border. It works quite well then I'm using select from math to set alive or dead state. There is another job that then copies this state back to tiles.
Then I have decided to try to get rid of this select. So I came with an idea to also precalculate dead alive state for all combinations of neighbours and current cell alive state and store it int blob asset. Then I calculate neighbours and current cell alive code and read it from this array.
I do not understand why this is much slower than the previous approach. About 20% slower.
Is math.select so fast it is not worth optimising?

[BurstCompile]
    public struct GameOfLifeJob:IJobParallelFor
    {

        [ReadOnly] public NativeArray<MapTile> Tiles;
        [ReadOnly] public BlobAssetReference<MapDataBlob> MapBlob;

        [WriteOnly] public NativeArray<byte> State;
        public void Execute(int index)
        {
            var tile = Tiles[index];
            var directionsStartEnd = MapBlob.Value.CodeStartEndIndex[tile.availableDirectionsCode];
            byte aliveNeighbours = 0;
            for (int i = directionsStartEnd.startIndex; i < directionsStartEnd.endIndex; i++)
            {
                var direction = MapBlob.Value.DirectionData[i];
                var newTile = Tiles[index + direction.IndexOffset];
                aliveNeighbours += newTile.isOccupied;
            }

            State[index] = (byte) math.select(0, 1,
                (tile.isOccupied == 1 && aliveNeighbours == 2) || aliveNeighbours == 3);


        }

    }
    [BurstCompile]
    public struct GameOfLifePrecalculatedJob:IJobParallelFor
    {

        [ReadOnly] public NativeArray<MapTile> Tiles;
        [ReadOnly] public BlobAssetReference<MapDataBlob> MapBlob;
        [WriteOnly] public NativeArray<byte> State;

        public void Execute(int index)
        {
            var tile = Tiles[index];
            var directionsStartEnd = MapBlob.Value.CodeStartEndIndex[tile.availableDirectionsCode];
            int code = tile.isOccupied << 8;
            for (int i = directionsStartEnd.startIndex,bit = 0; i < directionsStartEnd.endIndex; i++,bit++)
            {
                var direction = MapBlob.Value.DirectionData[i];
                var newTile = Tiles[index + direction.IndexOffset];
                code += newTile.isOccupied << bit;
            }

            State[index] = MapBlob.Value.LifeState[code];

        }

    }

[quote=“Micz84”, post:1, topic: 813348]
I have decided to implement Conway’s Game of life using the job system and optimise it best as I can.
I have created the first job GameOfLifeJob (source code below). It uses pre-calculated codes for neighbours so there are no ifs checking for map border. It works quite well then I’m using select from math to set alive or dead state. There is another job that then copies this state back to tiles.
Then I have decided to try to get rid of this select. So I came with an idea to also precalculate dead alive state for all combinations of neighbours and current cell alive state and store it int blob asset. Then I calculate neighbours and current cell alive code and read it from this array.
I do not understand why this is much slower than the previous approach. About 20% slower.
Is math.select so fast it is not worth optimising?
[/quote]

I have somehow the same result as for performance. https://discussions.unity.com/t/812189
By access blob memory the second time for some pre-calculated data make it slower.
re-calculate them may be the better choice, as long as your calculation is less than several instructions. The difference is not easy to compare it depends on your cpu’s cache status and the size of blob data. continues memory access is always better.

And yes, math.select is fast. Much better than c?a:b in most case. as math.select involves absolutely no branching. When a or b is an expression instead of a simple variable c?a:b is very likely to generate branch. because the compiler does not know if there are alias or maybe expression a could change value of b.
with select both a and b are pre-calculated before passing to the function so there’s no branch for sure.

also select can be auto vectorized to operate on 4 elements at a time (SIMD) by burst.
you first job look perfectly fine to be auto vectorized

2 Likes

math.select is typically faster than a memory read, especially if the memory read misses L1 (which I suspect given the performance impact despite the operation happening in the outer loop).

Thank you for your detailed replay. So right now slowest part is assigning the result to the array. This job takes about 1.5 ms when I have replaced:

State[index] = (byte) math.select(0, 1, (tile.isOccupied == 1 && aliveNeighbours == 2) || aliveNeighbours == 3);
with
tile.isOccupied = (byte) math.select(0, 1, (tile.isOccupied == 1 && aliveNeighbours == 2) || aliveNeighbours == 3);

When assigning to array this job takes about 25 ms. Does assign to array is so slow or is it due to false sharing or burst compiler is so clever to detect that this code does nothing meaningful and just does not run it at all?


Probably all of the above. You can check the Burst inspector to see if it is stripping that logic. I can't comment much on the others as I don't know your data layouts.

Yes! burst compiler is so clever to detect that this code does nothing meaningful and just does not run it at all.
Same as my test in curve. I was once thinking my curve is 300 times faster than unity's curve. But the job turns out to be constant timed no matter how large is the input data size.


Also, If performance is so critical. you can try stack allocate an array of bytes store calculation data in it. and process data in batch and use UnsafeUtility.MemCpy to pass data back.
Something like

        public struct MapTile
        {
            public const int ByteSize = 5;
            public int availableDirectionsCode;
            public byte isOccupied;
        }
        public struct MapDataBlob
        {
            public BlobArray<StartEnd> CodeStartEndIndex;
            public BlobArray<Direction> DirectionData;
        }
        public struct StartEnd
        {
            public int startIndex;
            public int endIndex;
        }
        public struct Direction
        {
            public int IndexOffset;
        }
        [BurstCompile]
        public struct GameOfLifeJob : IJobParallelForBatch
        {
            [ReadOnly] public NativeArray<MapTile> Tiles;
            [ReadOnly] public BlobAssetReference<MapDataBlob> MapBlob;
            [WriteOnly] public NativeArray<byte> State;
            public int MapTileByteSize;

            unsafe public void Execute(int startIndex, int count)
            {
                MapTile* tileCache = stackalloc MapTile[count];
                UnsafeUtility.MemCpy(tileCache, Tiles.GetUnsafePtr<MapTile>(), count * MapTile.ByteSize);
                byte* stateCache = stackalloc byte[count];

                for (int j = 0; j < count; j++)
                {
                    ref var tile = ref tileCache[j];
                    var directionsStartEnd = MapBlob.Value.CodeStartEndIndex[tile.availableDirectionsCode];
                    byte aliveNeighbours = 0;
                    for (int i = directionsStartEnd.startIndex; i < directionsStartEnd.endIndex; i++)
                    {
                        var direction = MapBlob.Value.DirectionData[i];
                        //var newTile = tileCache[j + direction.IndexOffset];//Not working
                        var newTile = Tiles[startIndex + j + direction.IndexOffset];
                        aliveNeighbours += newTile.isOccupied;
                    }
                    stateCache[j] = (byte)math.select(0, 1, (tile.isOccupied == 1 && aliveNeighbours == 2) || aliveNeighbours == 3);
                }
                UnsafeUtility.MemCpy(State.GetUnsafePtr<byte>(), stateCache, count);
            }
        }

I am just guessing your data type, MapTile.ByteSize can be some const value;

so with a larger batch, you should be reducing memory access dramatically.

I'm not sure if this is faster overall. But the inner loop will always hit cache. so it's blazing fast.
Let me know when you have a test result.;)
You may also try not to store data in a blob and calculate them on the fly. so you don't need to access remote memory in inner-loop at all. As I am not sure what it is in the blob, I am just keeping it as it is.

Edit:
As I look into your logic var newTile = tileCache[j + direction.IndexOffset]
is not going to work. as you need to access adjacent row/colmn.
I changed it back to native array access.
Maybe you can figure out a local adherent container type(that's a bit of tricky).
or copy just a block of tile in batch intilization (2 extra column and 2 extra row, before and after batch range)

I am also thinking about an extended GOL game by the way.:)

May Mr. Conway rest in peace.

Yes, you are right MapTile has contact size. I will implement it today and let you know how it has affected the performance. Blob is calculated at the start stores mainly two things, an array of neighbour data like, index offset. And an array which index is a code, and it stores start and end index to neighbour data array for particular code. So every tile has a code that corresponds to its valid neighbours. I have implemented this for my flowfield. Probably I should strip down this array only to valid codes fo game of life map.

I have implemented it, but I had to make some modifications to make it to work. And it runs comparable to the previous version.

public struct GameOfLifeBatchJob:IJobParallelForBatch
    {

        [NativeDisableParallelForRestriction]
        public NativeArray<MapTile> Tiles;
        [ReadOnly] public BlobAssetReference<MapDataBlob> MapBlob;

        [NativeDisableParallelForRestriction]
        [WriteOnly] public NativeArray<byte> State;
        unsafe public void Execute(int startIndex, int count)
        {
            MapTile* tileCache = stackalloc MapTile[count];
            NativeSlice<MapTile> mapTiles = new NativeSlice<MapTile>(Tiles, startIndex, count);
            NativeSlice<byte> states = new NativeSlice<byte>(State, startIndex, count);
            UnsafeUtility.MemCpy(tileCache, mapTiles.GetUnsafePtr<MapTile>(), count * MapTile.ByteSize);
            byte* stateCache = stackalloc byte[count];

            for (int j = 0; j < count; j++)
            {
                ref var tile = ref tileCache[j];
                var directionsStartEnd = MapBlob.Value.CodeStartEndIndex[tile.availableDirectionsCode];
                byte aliveNeighbours = 0;
                for (int i = directionsStartEnd.startIndex; i < directionsStartEnd.endIndex; i++)
                {
                    var direction = MapBlob.Value.DirectionData[i];
                    //var newTile = tileCache[j + direction.IndexOffset];//Not working
                    var newTile = Tiles[startIndex + j + direction.IndexOffset];
                    aliveNeighbours += newTile.isOccupied;
                }
                stateCache[j] = (byte)math.select(0, 1, (tile.isOccupied == 1 && aliveNeighbours == 2) || aliveNeighbours == 3);
            }
            UnsafeUtility.MemCpy(states.GetUnsafePtr<byte>(), stateCache, count);


        }

    }

I was getting at line 45 of your code. And I have added NativeSlice, because I think Mem copy on tiles and State does not use start index so it always gets first elements of an array or I am missing something?


Oh I did not offset the memcpy, it should be:
UnsafeUtility.MemCpy(states.GetUnsafePtr<byte>()+startIndex, stateCache, count);
and the job

public struct GameOfLifeBatchJob:IJobParallelForBatch
    {
        [NativeDisableParallelForRestriction]
        public NativeArray<MapTile> Tiles;
        [ReadOnly] public BlobAssetReference<MapDataBlob> MapBlob;
        [NativeDisableParallelForRestriction]
        [WriteOnly] public NativeArray<byte> State;
        unsafe public void Execute(int startIndex, int count)
        {
            MapTile* tileCache = stackalloc MapTile[count];
            NativeSlice<MapTile> mapTiles = new NativeSlice<MapTile>(Tiles, startIndex, count);
            NativeSlice<byte> states = new NativeSlice<byte>(State, startIndex, count);
            UnsafeUtility.MemCpy(tileCache, mapTiles.GetUnsafePtr<MapTile>(), count * MapTile.ByteSize);
            byte* stateCache = stackalloc byte[count];
            for (int j = 0; j < count; j++)
            {
                ref var tile = ref tileCache[j];
                var directionsStartEnd = MapBlob.Value.CodeStartEndIndex[tile.availableDirectionsCode];
                byte aliveNeighbours = 0;
                for (int i = directionsStartEnd.startIndex; i < directionsStartEnd.endIndex; i++)
                {
                    var direction = MapBlob.Value.DirectionData[i];
                    //var newTile = tileCache[j + direction.IndexOffset];//Not working
                    var newTile = Tiles[startIndex + j + direction.IndexOffset];
                    aliveNeighbours += newTile.isOccupied;
                }
                stateCache[j] = (byte)math.select(0, 1, (tile.isOccupied == 1 && aliveNeighbours == 2) || aliveNeighbours == 3);
            }
            UnsafeUtility.MemCpy(states.GetUnsafePtr<byte>()+startIndex, stateCache, count);
        }
    }

Try run the job with a larger batch size like 512/1204

I can't offset pointer this way. There is an error saying that there is no +operator for void* and int.
Edit:
Ok I have cast it to byte* and it is working but when I turn on Burst it crashes.


sorry

UnsafeUtility.MemCpy(((byte*)states.GetUnsafePtr<byte>())+startIndex, stateCache, count);


I didn't refresh the page, As I wrote in edit now is crushes when burst compiled. Performnce when not bursted is comparable. Ok, I made mistake in MemCpy now it is working ad it is about 10% faster than my implementation.


Okay, there are 3 more remote memory read in your job

  • MapBlob.Value.DirectionData*;*
  • - var directionsStartEnd = MapBlob.Value.CodeStartEndIndex[tile.availableDirectionsCode];*
  • - var newTile = Tiles[startIndex + j + direction.IndexOffset];*
  • and 2 and 3 iteration count depends on directionsStartEnd;*
  • by removing one of remove memory access you get about 10% performance gain.*
  • if they are all somehow calculated local or pre-cached locally into some data chunk.*
  • it could be much faster.*
  • That basically what ECS and ArcheytypeChunk do.*
1 Like

*
  • I have found a way to get rid off use of BlobArrays in loop but strangely it is slower now. The previous version about 24-26ms and the new one a 36-40 ms.*
  • ```csharp*
  • [BurstCompile]
    public struct GameOfLifeJob:IJobParallelForBatch
    {

    [ReadOnly] public NativeArray<MapTile> Tiles;
    [ReadOnly] public BlobAssetReference<MapDataBlob> MapBlob;
    [WriteOnly] public NativeArray<byte> State;
    unsafe public void Execute(int startIndex, int count)
    {
        byte* stateCache = stackalloc byte[count];
        int* directions = stackalloc int[8];
        UnsafeUtility.MemCpy(directions, ((int*)MapBlob.Value.MoveDirectionIndexOffset.GetUnsafePtr()), 32);
        for (int j = 0; j < count; j++)
        {
            var tile = Tiles[startIndex+j];;
            var code = tile.availableDirectionsCode;
            byte aliveNeighbours = 0;
            for (int i = 0; i < 8; i++)
            {
                var valid = (byte) ((code >> i ) & 1);
                var direction = valid * directions[i];
                 var newTile = Tiles[startIndex + j + direction];
                aliveNeighbours += (byte)(newTile.isOccupied * valid);
            }
    
            stateCache[j] = (byte)math.select(0, 1, (tile.isOccupied == 1 && aliveNeighbours == 2) || aliveNeighbours == 3);
        }
        UnsafeUtility.MemCpy(((byte*)State.GetUnsafePtr<byte>()) + startIndex, stateCache, count);
    
    }
    

    }*

  • EDIT:*

  • Strangely without Burst new solution is much faster than the old one, but with Burst the old one is faster.*
    [/quote]


  • Okay, so your blob is only 32 bytes in size. It will fit in cache anyway. Burst will magically make it fast because blob has no alias for sure. MemCpy is not necessary and will make it slower. and the new one did not cache tiles. so tile memory read is making it slower too. the 32 byte blob can be held in a struct and pass to the job directly by value. And process one row in each batch will allow you to cache Tile pretty well.
    you just need to cache three rows of tile in stack row[n-1,n,n+1] when processing row[n].
    And you will be able to work only on local data in the inner loop.

    Yes, I have tied to do tiles in the local cache but still, it is slower than the older version.

    [BurstCompile]
        public struct GameOfLifeJob:IJobParallelForBatch
        {
    
            [ReadOnly] public NativeArray<MapTile> Tiles;
            [ReadOnly] public NativeArray<int> MoveDirectionIndexOffset;
            [WriteOnly] public NativeArray<byte> State;
            public unsafe void Execute(int startIndex, int count)
            {
                const byte one = 1;
                int shiftLeft = startIndex == 0 ? 0 : count;
                int shiftRight = startIndex == Tiles.Length - count - 1 ? 0 : count;
    
                var tilesCacheSize = count + shiftLeft + shiftRight;
                MapTile* tiles = stackalloc MapTile[tilesCacheSize];
                UnsafeUtility.MemCpy(tiles , ((MapTile*) Tiles.GetUnsafeReadOnlyPtr()) - shiftLeft + startIndex, MapTile.ByteSize*tilesCacheSize);
                byte* stateCache = stackalloc byte[count];
    
                for (var j = 0; j < count; j++)
                {
                    var index = shiftLeft + j;
                    ref var tile = ref tiles[index];
                    //var index = startIndex + j;
                    //var tile = Tiles[index];
                    int code = (tile.availableDirectionsCode);
    
                    byte aliveNeighbours = 0;
                    for (var i = 0; i < 8; i++)
                    {
                        var move = code >> i;
                        var valid = (byte) (move & one);
                        var direction = valid * MoveDirectionIndexOffset[i];
                        var newTile = tiles[index + direction];
                        //var newTile = Tiles[index + direction];
                        aliveNeighbours += (byte)(newTile.isOccupied * valid);
                    }
                    stateCache[j] = (byte)math.select(0, 1, (tile.isOccupied == 1 && aliveNeighbours == 2) || aliveNeighbours == 3);
                }
                UnsafeUtility.MemCpy(((byte*)State.GetUnsafePtr()) + startIndex, stateCache, count);
    
    
            }
    
        }
    

    I am scheduling my job like this

    Handle = lifeJob.ScheduleBatch(_Map.tiles.Length, mapWith);

    My question is Does it guarantee that every job will have count number equal to mapWidth? Because it is necessary to process it row by row. When not Bursted on a single thread the new job is like 3 times faster. 68 ms to 214 ms on my machine.


    So, Maybe Burst is doing the job of keeping Tile memory in the cache. MemCpy for reading will only make it slower?
    Only the writing cache is necessary.
    If that's the case you have already hit the upper limit somehow.


    OK thank you for your help and explanation of everything. I have learned a lot from you :).