# Random access and l2/l3 cache help?

Hello. I was hoping someone could help me understand my problem. I have an a* pathfinding algorithm on a custom NavMesh. I have been trying to figure out the best way to lay out my data so that I avoid cache misses. So far, I have thought of three scenarios. A quick run through of the pathfinding for context…

If start and end pos are in same area
Walk there
Else If start and end are in a hierarchical area which is custom marked as “sparse”
Just walk there
Else If the distance to end goal is within some short value
Raycast and is LOS walk there else a*
Else
a*

Also when doing a*, some nodes can be marked as special meaning there are an entrance into an area of dense nodes with precalculated paths, meaning these nodes have 2 sets of neighbors, regular, and the other entrance nodes its connected to. So if the destination isnt within that cluster, it skips checking all nodes within.

So as for cache misses, either.

1. I can store all nodes in one array. First node will be actual node, and consecutive 4-5 nodes are neighbor nodes. They are exact copes of the actual node structs, with root indexes pointing to their actual position in array. This means all node checking will be in l1 cache, but takes more memory.

2. I can store an array of node structs with full values, and a separate int array of node neighbors. This takes least memory, but every check is a random access.

3. I can have an array of full nodes, and an array of “neighbor nodes”, where the full node point to a location to second array, and can check all neighbors this way. This means linear neighbor check with an extra random access from node to neighbor array, but uses less memory than #1.

My question is this. Is it worth to sacrifice full l1 node search for less memory, meaning all data will fit in l2 and l3 cache?

For example #1 has entire node checking in l1, but means that in order to get the data, it might have to make one trip to l3. For #2 everything will fit into l2, but every check is an l1 cache miss. For #3, there is one potential l1 and l2 miss, but the node checking is still all l1.

So is keeping everything on cpu cache worth sacrificing a linear search of nodes? Is lots of random access to l2 cache faster than occasional l3 or Ram access, but very few times compared?

Here is my memory usage with scenario #3 worst case:

NavPointRoot - 30,000 kb
NavPointNeighbour - 96,000 kb
NavPointHiNeighbour - 20,00 kb
NavArea - 42,000 kb
NavIndexArray - 6,000k b
NavEdge - 30,000 kb

All of this fits on most l2 caches, but I wont get l1 linear reads. If I go with option #1, I will have to go to l3 more often.

And is it even worth it if all my values have to be converted to ints every time? In order to fit on cache?

``````[StructLayout( LayoutKind.Sequential , Pack = 1 )]
public struct NavArea
{
public AreaType type;
public int nodesIndex;
public ushort p1;
public ushort p2;
public ushort p3;
public ushort p4;
public ushort p5;
}

[StructLayout( LayoutKind.Sequential , Pack = 1 )]
public struct NavEdge
{
public bool walkable;
public ushort px1;
public ushort py1;
public ushort px2;
public ushort py2;
}

[StructLayout( LayoutKind.Sequential , Pack = 1 )]
public struct NavPoint
{
public ushort type;
public ushort px;
public ushort py;
public ushort rootIndex;
public byte nNeighbours;
public byte nEntrances;
}

[StructLayout( LayoutKind.Sequential , Pack = 1 )]
public struct NavNeighbour
{
public ushort type;
public ushort px;
public ushort py;
public ushort rootIndex;
}
``````

This is actually slightly more memory, unless I am misunderstanding something. Just because you split the neighbor nodes in a separate array doesn’t mean you are consuming less memory. Ultimately your access patterns are what matter. Definitely profile to see which process is taking way longer than you would expect and seems like a bandwidth issue.

With this specific scenario, there would be an array of NavPoint, and an array ofNavNeighbour, 4 times the size. In #1, there would be an array the size of 5 times number of NavPoints, of NavPoint struct. It uses more memory.

Anyway, what I am trying to get help understanding is the relative speeds of l1,l2,l3 caches. Is the difference in speed between l1 and 2, l2 and l3 and order of magnitude? Less, more? How about l3 vs RAM? Profiling al these scenarios would take alot of time because maps can vary greatly in size and number of nodes, and also different cpu have different hardware.

So is there a general ratio of speed between the different cache levels? I am trying to base my access patterns on this.

4 + 1 = 5. Plus in the dual array case you have to store the offset into the neighbor array.

L2 ~ 10x L1
RAM ~ 10x L2
L3 is weird and has more to do with context switching and keeping things together than anything to target performance-wise. Most people I know ignore it when discussing cache performance.

1 Like

Ok maybe I explained it wrong. Struct NavPoint is 10 bytes in size. Struct NavNeighbour is 8 byts in size.

#1 One array of NavPoint. Searching is access index in NavPoint from priority queue, and search neighbors, which are consecutive in the array. The array is structured like root NavPoint, then consecutive 4 neighbours. Neighbours stored as NavPoint

So nNodes * (1 + 4 neighbours) * 10 bytes.

#3 An array of NavPoint. An array of NavNeighbour (4 for each navpoint). The NavNeighbour structs are 8 bytes. Neighbours stored as NavNeighbour.

nNodes * 1 * 10 bytes + nNodes * 4 * 8 bytes

Thank you! I didn’t realize these things. This definitely helps alot! It seems like I should aim to keep everything in at least l2 then…, so lighter memory?

Also, I know this may be a bad question, but like how many ticks would a ram access be compared to something like a float multiplication or something. Very very broadly?

RAM access takes several hundred cycles. ALU operations go through approximately a 20 stage pipeline on x86. Throughput is multiple instructions per cycle though, so quite a few instructions can be in flight at once.

1 Like

Awesome, thanks.

If I may… I understand somewhat that l3 is also used for parallel operations. Does it have any relevance to single cores?

Based on what you have said I have arranged my data to all be able to fit on l2 at the same time (assuming l2 of size 256kb). Now, what would be your thinking in the following?

I only need to access the areas twice for each pathsearch. Once to get the current area the unit is in, and all the nodes in it, and second to see if the final dest is in the same area.

I only ever check for edges when doing a LOS, and that only occurs if the target dest is within a short distance, so perhaps accessing at worst around 20-30 edges, and doing line intersection tests.

If I store these in a MultiHashMap, they will fit on l2 with everything else, which means for each path search I really only go to RAM once!

But hashing is (can be) slow?

Otherwise, I could store a grid of the world with lots of wasted space and definitely will not fit onto cache. But access is fast.

So I will profile both, but I’m just curious on your thoughts? Is something like a hash plus maybe a hash collision more or less ticks than a ram access with then linear accesses for the edges?

Would this change if instead of 20-30 edges max would be around 5-6?

You are going to have to profile. These things vary wildly across different machines.

Is this the reason optimization can be so hard?

1 Like

One of many.

2 Likes

When laying out structs, should they be made to be intervals of 4 in size? I have read in some places there is sometimes issues when the cpu expects data to be arranged in a certain way.

Intervals of 4 what?

4 bytes. As in a struct can be 4 or 8 bytes, but shouldn’t be 2 or 6?

There can be problems, but I wouldn’t pad structs to avoid the issue, as bandwidth is typically a bigger problem. As always, profile.

Ok, thanks. I’m sorry if my questions are really naive or broad, but I only have a few years experience mostly in scripting and very little school so everything I learn is online and through practice.