How to figure out why my job takes so long?

So I’ve made a basic pathfinding job, its single threaded bursted currently as I want to get it down as far as possible like that first. But while it does take quite a while in general there’s a sudden jump and increase in time it takes when a certain number is added.

For example it takes 101ms average in a grid space of 18 cells but then it jumps to 610ms at 20cells. Below 18 the values are reasonable and increase in a linear manner, but I don’t understand the jump from 18 to 20 and how it increases so exponentially.

I have just a general question, and a specific, how do I figure out in general why and how this is happening. Do I have to read the burst assembly language(please don’t say I have to read assembly language). I can use markers and stuff to drill down to exactly where the time is being taken but it still doesn’t tell me exactly why and what to do about it. I guess I don’t know enough about profiling and optimizing in general but jobs seem even harder and the information is pretty scant and vague currently. Anyway any advice is appreciated so I can nail this down, I’ll post the job if anyone has any detailed insights as to why its taking so long.

Entities       
            .ForEach((in AllySelectedDat Allydudes) =>
            {
                CellData neigborCelldata = new CellData();
     
                for (int e = 0; e < PotentialMoveCells.Length; e++)
                {
                    indicesToCheck.Clear();

                    for(int p = 0; p < PotentialMoveCells.Length; p++)
                    {
                        var tempwhy = PotentialMoveCells[p];
                        tempwhy.bestCost = byte.MaxValue;

                        if (tempwhy.cost != 255)
                        {
                            tempwhy.cost = 1;
                        }

                        PotentialMoveCells[p] = tempwhy;
                    }
                    var Tempdesination = PotentialMoveCells[e];
                    Tempdesination.bestCost = 0;
                    Tempdesination.cost = 0;
                    PotentialMoveCells[e] = Tempdesination;
                    indicesToCheck.Enqueue(Tempdesination);
             
                    while (indicesToCheck.Count > 0)                        
                    {                    
                        CellData curcellDat = indicesToCheck.Dequeue();
                        var deneigbors = GridCellHelperStuff.GetNeighbours(curcellDat.gridIndex, jobflowfieldat.gridSize.y);
                        foreach(int2 neigh in deneigbors)
                        {
                            neigborCelldata = new CellData();
                            var cachepotent = 0;
                            for(int i = 0; i < PotentialMoveCells.Length; i++)
                            {
                                if(neigh.Equals(PotentialMoveCells[i].gridIndex))
                                {
                                    neigborCelldata = PotentialMoveCells[i];
                                    cachepotent = i;
                                }
                            }                      
                            if (neigborCelldata.cost + curcellDat.bestCost < neigborCelldata.bestCost)
                            {
                                neigborCelldata.bestCost = (ushort)(neigborCelldata.cost + curcellDat.bestCost);       
                                indicesToCheck.Enqueue(neigborCelldata);
                               PotentialMoveCells[cachepotent] = neigborCelldata;
                                if(neigborCelldata.gridIndex.Equals(Allycurrpos.int2Val) && neigborCelldata.bestCost <= Allymovedat.intVal)
                                {                               
                                    ConfirmMovePoints.Add(Tempdesination.gridIndex);                               
                                    goto Whilend;
                                }
                            }
                        }
                    }     
                Whilend:;
                }
            }).Schedule();

There are two reasons I know where a sudden spike can happen after increasing the count slight. One is if you exceed a cache level’s capacity with your data. And the other is if you start having fallback allocations with a particular allocator. I would check the latter with those Enqueue() calls.

1 Like

My best guess is that eventually the algorithm reaches a point where the necessary data won‘t fit into CPU L1/L2 cache anymore and the drop is due to slow memory (or L3 cache) accesses.

If you were to parallelize this job and it does not take such a performance hit with the same dataset this might be an indication. This is also why I wouldn‘t optimize for a single job/thread first, you‘ll likely shoot yourself in the foot if you don‘t design and test it with the actual goal in mind (parallel processing).

1 Like

You said you can use markers, so I assume you know about Unity - Scripting API: ProfilerMarker. It is compatible with Burst.

Often you want to optimize nested loops. My initial thoughts are the loop on line 36 needs improvement. Put a ProfilerMarker to time it compared to the entire process.

If I understand correctly, can’t you add a break after line 41? Even better, can you use a sorted list or a different data structure for PotentialMoveCells to find the matching gridIndex much faster?

for(int i = 0; i < PotentialMoveCells.Length; i++)
{
    if(neigh.Equals(PotentialMoveCells[i].gridIndex))
    {
        neigborCelldata = PotentialMoveCells[i];
        cachepotent = i;
        break;
    }
}
1 Like

Thanks for the replys.

Yeah that was one of my first thoughts, the cache was going above some limits but I guess there’s no real specific way of telling that exactly. I did check the burst inspector, and it looks theres lots of yellow Allocator.Manager text I’m guessing that might not be good. I’ll look at replacing the Nativequeue for a Nativelist or something and see if that makes a difference.

The reason I didn’t want to parallelise it to early is because I didn’t think it would make much difference , and this job can easily be changed between the two.

Yeah I am using the markers to pinpoint things, and thanks I forgot about the break. But that for loop seems to be about 0.002ms of each loop, the if sequences below it are what seem to take up 90 percent of the time.

1 Like

I’d say that in general, understanding performance once one is already bursted and looking at assembly is a matter of understanding the hardware one is running on. Hardware-specific profilers for the particular hardware one is running on can often be more useful for this once you get really nitty gritty.

There are many avenues for understanding modern cpu performance, but here are two of my favorites:

1 Like

Really interesting links, thanks a lot for those, they are exactly the sort of thing I’m looking for. Part of my question as well was, as to whether how do you know the feasibility of your approach. Like how do I know whether its possible to get my job fast enough for it to be used in a game it seems like those might help to figure that out.

1 Like