Cache efficiency of IJobForEach/Entities.ForEach across chunks?

IJobForEach and Entities.ForEach appears to give you one thing to work on at a time, but that means there must be a Burstable, invisible loop over it.

The single component you requested could span multiple chunks due to using components/tag components liberally. (liberally = to benefit flexibility of EntityQuery so it could always have some way to select chunks that serves the need of the game, while keeping data duplication minimum.)

Two things concerning performance regarding memory are the cache line size and prefetching. I am wondering should I care about limiting tag component to make the invisible part more “in the same linear piece”? Because these two is to be the default accessing method I have to start caring about this.

Assumptions

  • Requesting one type of component.
  • 64 bytes of cache line size.
  • 16kb * N fixed chunk size added for each archetype and N times that it goes over chunk capacity is not a problem, only care about performance.

Here’s what I guess are happening :

  • There is a routine to link up (but not copy, since that would be expensive each schedule?) multiple linear array of a component, from different chunks.
  • So occassionally when it jumps from chunk to chunk I may have at most 64 bytes of unused cache memory because physically they are far apart by 16kb fixed chunk size and there are a lot of empty space to cross most of the time.
  • If each component is one integer (4 bytes), then it is about 16 things that could be modified faster because they got added for free to the cache line per request.
  • Therefore imagine if I have about 30~50 things to iterate (not millions), having them all in the same chunk could make big difference (this is a very hot code path) as opposed to they are fragmented in 4 chunks of 10~15 things each because of my tag usage.
  • Prefetching could be a problem also, is there anything added so compiler know how to cross the chunk and get next data ahead of time as if it is just one linear array? Or by being careful about my archetype I am helping the prefetching too?

When NOT using tag component liberally in order to make things in the same chunk could really pay off, I have to complement it with data duplication occassionally which reduce maintainability. (Now you have to remember to modify 2 same things instead of one, etc.) Have a version of the same data, but arranged in a different way just for fast access.

For example I may have all cars planned with exactly the same archetype so they are all in the same chunk, they all run each frame and need calculation. (e.g. some car doesn’t have booster and don’t need booster stats, but I attach the component anyways for the sake of same archetype but simply ignore it)

But equally hot codepath requires only the red cars to check for collision so I would also like all the red cars separated. (Maybe by ISharedComponentData that has an enum color Red) But now if I do that, the previous “all cars” routine will have to cross from non-red to red cars chunk. This is just 1 cross but in real game it could get out of hand if we use tags/ISharedComponentData in a way that benefits EntityQuery “too much”.

But if these chunk crossing is negligible then I can use the tag the same way I was using.

1 Like

JobComponentSystem’s Entites.ForEach codegens into an IJobChunk. Both IJobChunk and IJobForEach iterate over an array of chunk pointers assembled by a prepareFilteredChunksList job.

The cost of switching chunks is about the cost of a ComponentDataFromEntity access. It certainly isn’t negligible, but it is pretty cheap, especially compared to anything in non-DOTS land.

Whether or not Tags and ISharedComponentData are helpful or harmful really depends on the algorithms. For example, if I had a cloth sim job, having 16 entities in 4 chunks would be vastly superior to having 16 entities in one chunk because of multithreading. However, if I was forward-stepping a simulation n-number of times in a loop scheduling several IJobForEach.ScheduleSingle every iteration (ScheduleSingle because I am doing reads and writes on a NativeContainer), having everything be in one chunk would be ideal.

My suspicion is that you are dealing with an O(n log n) or O(n^2) algorithm here. In that case, I would suggest copying the data out of chunks and into NativeArrays before the job and copying the results back after the job rather than force yourself to duplicate logic or add more work of manually filtering entities that could be pre-filtered with the archetype system.

Also keep in mind that at low entity counts, all your data will probably fit in cache, even if they are spread across multiple chunks. So there’s no real need to worry about prefetching.

And lastly, if your hot path is really hot, rather than worrying about if you have 1 chunk vs 4, you may want to dive into the Burst inspector and see if you can get your code better vectorized and doing less work.

3 Likes

Thanks, I think I was too focused on a cache line and forgot there is a bigger cache capacity. Though I am trying to use a smaller data type such as ushort or byte to increase elements read in a single cache line.

Also it’s been a while since I touch ECS I completely forgot that .ForEach (when scheduled) / IJobForEach works in parallel by chunk unit. Having multiple chunks (naturally/unintentionally via using tags/SCD liberally) seems to fit well with this. Combined with maintainability sparse chunks sounds like a better idea. Maybe in the future if there is an option to select a smaller chunk size in the build settings or something, then it would be perfect.

1 Like