Could moving the code to the data sometimes be faster?

OK DOTS optimises processing performance by ensuring great cache utilisation.

So data can quickly be moved into caches for processing, but we know that moving large blocks of data anywhere is inherently slower than small blocks.

If the code is smaller than the data would it be faster to move the code to the data and can DOTS, the CPU, the OS move the code when it’s faster than moving the data?

What?

6 Likes

What?

Edit: the closest thing that comes to mind is the hot/code splitting optimization pass in LLVM, which is probably inherited by Burst.

You can try using the Hint class to help Burst determine which branches are likely to be hot or cold. This makes it more likely that your code will fit in the L1 instruction cache (32KB per core on most x86 CPUs, I think).

Hard to predict what you exactly want to say but:

  • for processor code is Data and each processor core have L1 cache for code and another L1 for data L2 and L3(where exists) keep code and data inside
  • to process anything processor need 2 data sets 1 for data to process and one for code instructions that is “how to process”

So actually “great cache utilisation” is when you dont have cache miss for data and dont have cache miss for code!
DOTS pack all code into Job. Burst compile all code for job into one big function, one big set of instructions (ideally without branches) to be most efficient for cache prefetch. ECS pack all data into continuous chunks of data for better cache prefetch inside each chunk and promise to find a way to prefetch next chunk while working on current.

Any random access kill performance of Processor
You can access data on another entity while process current - it is random access for data → often cache miss
You can call virtual function (in OOP or PunctionPointer in ECS) - it is random access for code → often cache miss

Have a nice day :slight_smile:

I think OP is being funny, but in case they’re not:

For the purposes of this comment, let’s say data is memory you can read and write, but not execute, and code is memory you can read and execute, but not write.

The CPU does not run faster when the executable code is located in memory near the data. (There are some minimal caveats, like if the code is data, such as with some JIT languages, but we can safely ignore those around here.)

In fact, on modern CPUs, appreciable effort has gone into making sure code and data don’t occur in the same memory regions for reasons of security and stability. Any unit of memory that matters for performance (page, cache line, …) is generally not allowed to contain both data and executable code.

The strategies for optimizing the memory layout of data differ greatly from comparable strategies for code. With data, you generally want things to neatly align with page boundaries, cache lines and register sizes. At each level, the specific reasons are different, but generally have to do with enabling the CPU to efficiently load things into types of memory where they are faster to access, or can be operated on (cache, registers…). The layout of data is pretty easy to optimize well, which is what allows ECS to make intelligent decisions about where to put stuff in memory.

(I’d go as far as saying ECS is mostly a fancy malloc, but I might get yelled at for saying that.)

Memory that contains code to be executed gets fetched into different caches and different registers. These have different properties, which are a lot harder to intuit. (Mostly they have to do with enabling branch prediction and speculative execution, which boils down to the CPU never running out of stuff to do.)

This principle is used in large distributed systems, sometimes it is a lot faster to move the ‘code’ to the ‘server’ / [data] instead of data to the code (using server over network). It doesnt work like you say but the principle works in systems were the cost of moving either is very high which is often the case in transactional distributed systems. I know a few systems that work by passing functionality through data servers keeping data always local to the machines maintaining that part, it can be orders of magnitude faster in certain situations. That said, those have nothing to do with normal games however and are quite specialistic in purpose, like estimating power use for all homes in a country. (built that)

I do like to stress out this principle is very close to becoming real with samsung experimenting with small processing capabilities on ddr memory, i doubt however if we as programmers get much control over that.

JIT languages are a different beast and it makes all the difference to trash your cache constantly or not but you lose performance because of the translation step to native code from the ‘jit’ data and any benefit in speed because of locallity you will lose in that compilation step.

It really is quite simple if the size of the data is larger than the code and it’s in a cores cache, then moving the code to that core should be faster than moving the data up and over the cache system to another core.

That’s actually pretty smart but probably makes a lot more sense in mobile phones, TVs and such sort. Even if the data can be processed faster locally in memory, at some point it’s likely the data has to move in the CPU cache.
Still a very interesting prospect. Let’s see how this goes or if it’s just a neat idea that will fail in practice.

All small operations running in a patterns on blocks of memory would probably benefit, also cache wil be less used for these blocks and can be used for other things. It would also save loads of transfers on the memory bus. It guess it will find its way everywhere and probably not initially in phones because it will use more power then normal memory and it wont replace all tasks in the cpu making it not a direct replacement for power consumed in the cpu. Im probably wrong however… seeign samsung making these…

but then agan lets see what the future holds, its exciting