hi, anyone any experience with a light weight scripting engine that can be used from burst code with only minimal symantics? Im currently developing something that compiles a c-like script into a struct with a fixedbyte[512] for the code segment (8 bit words) and a fixedfloat[16] for the data stack. (enough for what i want for now) with a simple interpreter im able to use it in burst jobs and while i havent tested any performance it feels very fast… currrently i have support for scripts with simple arithmetic and im adding support from the functions from unity.mathematics but i guess after that ill be adding the next thing…
it feels like reinventing the wheel. what do you guys use in burstable code? everything i seem to find looks way to heavy and impossible or not performant enough to use from burst?
The data structures required for script execution are usually full of pointers and ASTs and string manipulation, parsing and … you will probably will not benefit too much from coding the interpreter in burst, however it can create a set of assembly like instructions to be executed by a burst runtime.
i compile the script into bytecode much like assembly, the burst interpreter for that bytecode is very simple. no pointers, no strings and no variable length structures in the native side of things:
struct BurstScriptExecutable
{
/// <summary>
/// unique script id
/// </summary>
public int Id;
/// <summary>
/// byte code compiled from script
/// </summary>
public fixed byte code[512];
/// <summary>
/// size of code in bytes
/// </summary>
public int code_size;
/// <summary>
/// index into bytecode, next point of execution, if == code_size then end of script is reached
/// </summary>
public int code_pointer;
/// <summary>
/// index into dataregister, starts from data[start_stack_index]
/// </summary>
public int stack_pointer;
/// <summary>
/// input, output and scratch data fields
/// </summary>
public fixed float data[32];
/// <summary>
/// nr of data registers
/// </summary>
public int data_size;
}
performance is already as i want it in burst jobs, its only that it feels like im making stuff already made… if not ill probably put in on github or something when ready… but id rather use a proven solution.
Using the burst as a jit like how it would be possible with c# to compile in runtime into il seems not possible for now, id really like to be able to do that. I could outout IL instead of bytecode but could i have the burst compiler compile it in runtime and give me back a function pointer?? no idea
It is not likely that people built scripting runtime in burst or at least built and shared it online too. Less than 100 real companies probably are using it directly at this moment and most probably only for calculation heavy stuff.
got it working and its fast, 10 million times running a script is so fast i had to check my timins tripple and again:
my results for the following test:
#bscript test
a = 1 * 2; # a = 2
b = -1 * 2; # b = -2
c = a / b; # c = -1
d = c * 100; # d = -100
e = c + 1 * 100; # e = 0 (no muplication rules yet - should be simple node-reorder)
f = c + (1 * 100); # f = 99
This is compiled in into 44 bytes of bytecode and a stack of 9 floats and put into the struct in the previous post, some prelimenary results in the editor: (i9900k, gtx1080)
(note that on the 2nd run compilation is largely skipped)
im suprised that even with a clumsy interpretor i get better performance then i wanted i guess my cpu likes that everythin fits in a single cache line
next ill be adding some control flows, vectors (float3) and functions and it should be usable.
anybody an idea how to have burst compile a jump table from a switch statement on a byte enum? i guess that theres some codestyle to adhere to for ncpp to convert it in a jumptable… it wont for now… its the biggest slowdown in the interpreteror and using arrays of function pointers… i dont know if and optimizer will benefit from that
10.000.000 executions of this in less then 3 ms in editor, i wasnt expecting that even for such simple script
edit: results in runtime are even more dramatic, 0.1ms in burst over 16 cores, for 10M executions of the above script:
i think i will have it support something like this, it is based on a tokenizer but there is no grammer (intentionally)
#
# burst script test 6
#
# - no declaring variables, this is not needed
# - NO SCOPE
# - no strings
# - no allocations
a = 1 * 2; # a = 2
b = -1 * 2; # b = -2
c = a / b; # c = -1
d = c * 100; # d = -100
e = c + 1 * 100; # e = 0 (todo - reorder nodes in parsetree)
f = c + (1 * 100); # f = 99
#
# functions
#
g = abs(c) * f + (max(a, b, c, d) * 4)
#
# conditions
#
w = (a > b) & (c < d) ^ (e > f); # note single char operators, executed left to right
w = !w;
#
# control flow
#
if(b > 0)
f = 16;
else
f = 4;
if(f = 4) {
h = 50 - min(1, 2, a, b, c, d, e, f) * 2;
i = 1;
}
else
{
h = 1
i = 1
}
while( a > b ) # with support for break and continue
{
b = b * 2;
a = b;
}
#
# jumps
#
@label1:
goto label1; # i like goto
jump_z(a, label1); #jump if a = 0 to label1
jump_gz(a, label1); # jump if a greater > 0 to label1
jump_nz(a, label1); # jump if a not zero to label1
jump_lz(a, label1); # jump if a smaller < 0 to label1
#
# vectors
#
j = [1, 2, 3];
k = [3, 2, 1];
m = j * k;
n = [1, 2, 3] * 4;
#
# vector ops
#
o = normalize(n);
p = min(n); # should return smallest element
q = min(n, m); # should return smallest vector n or m
r = min_arg(n, m); # should return smallest element of each vector as a [,,]
#
# functions
#
function add1(a) { return a + 2; }
function add2(a) { a = a + add(2); a = a + add(1); return a; }
s = add2(1))a
#
# signals
#
signal_a = 80808080;
set(signal_a, 1);
t = wait_for_signal(signal_a); # will yield until signal > 0
#
# yield
#
yield; # return back execution untill next frame
yield a; # yield for n frames
return; # return indefenately
#
# stack stuff (more the 32 (config) push will result in allocations by the interpretor
#
pop a
push a
peek a
#
# constants
#
u = deltatime;
v = elapsedtime;
external functions can be added as compiled fragments of bytecode for the interpretor based on the code the compiler finds, or externally from some library of scripts while some will be built in (math functions etc), float3 and quaternions will be compatible with vectors of size 3 resp. 4
what do other people think i really miss? scoping inside a fragment like this is probably something i dont want/need/see much use for it as functions will be executed in a different context but hey maybe other poeple really want it… there are some performance issues implementing that while it ass little for simple scripts
Depends on users of your scripts but probably having jumps and goto and jze and … is not a good idea if those who use this are designers/high level programmers
if you are compiling a higher level language to this byte-cocde then my point is mute of course.
during startup it compiles in managed code something like:"
f = 1 + 2 * 2 + 5;
into:
001 128 017 130 129 002 131 000
readable:
assign <f> fma #2 #2 #1 + #5 nop
it correctly re-ordered for multiplication while analysing the parsetree, then the optimizer used pattern recognition to replace ‘a + b * c’ (in any order) into fma b c a :p, this results in a 8 bytes of bytecode and only 3 switches because it can ‘prefetch’ parameters for an operation like fma (from 5 to 1 switch). this is getting fun i could also void the whole call because the parameters are constant but i dont want to make that optimization yet, ill save that for last.
guess ill add if/then for and while loops as nobody i asked knew what jz ment…
burst blasts through the bytecode this and attaching the resulting scripts with ibufferelementdata entities gives me a fast and clean solution for having millions of entities with unique behaviour (if i could only render that on my 1080)
i have a hard time myself believing my own benchmarks, note that this is for 10M executions of the script in burst.
burst really likes scripting like this, the cache would probably glow red hot on running it 10m times… now i need to find out a nice method to get the data from IComponentData… i could point the stack and map fields without allocations or copies but that seems nasty…
functions, controlflow, i dont really need it right now but they will follow
Reading scriptfile: ai/scripts/test5
CODE:
a = 1 * -2; # -2
b = -1 * 2; # -2
c = a / b; # 1
d = c * 100; # 100
e = 1 + 2 * 2; # 5
f = 1 + 2 * 2 + 5; # 10
g = 1 + 2 / 6; # 1.3333
h = 1 + -2 / 6; # 0.6666
i = 1 - 2 * 2 + 5; # 2
j = 1 + 2 * 2 - 5; # 0
k = 1 +-2 * 2 + 5; # 2
COMPILE: 13,0137 ms
CODE SEGMENT: 83 bytes
001 128 129 005 003 130 000 001 131 003 129 005 130 000 001 132 128
004 131 000 001 133 132 005 134 000 001 135 017 130 130 129 000
001 136 017 130 130 129 002 137 000 001 138 017 130 144 129 000
001 140 018 130 144 129 000 001 141 018 130 130 129 002 137 000
001 142 017 130 130 129 003 137 000 001 143 018 130 130 129 002
137 000
ASSEMBLY:
set <a> #1 * - #2 nop set <b> - #1 * #2 nop set <c> <a> / <b> nop set <d> <c> * #100 nop set <e> fma #2 #2 #1 nop set <f> fma #2 #2 #1 + #5 nop set <g> fma #2 #0,166666672 #1 nop set <h> fsm #2 #0,166666672 #1 nop set <i> fsm #2 #2 #1 + #5 nop set <j> fma #2 #2 #1 - #5 nop set <k> fsm #2 #2 #1 + #5 nop
DATA SEGMENT: 17 floats
var 1 name = a value = -2 refcount = 2
var 2 constant value = 1 refcount = 18
var 3 constant value = 2 refcount = 28
var 4 name = b value = -2 refcount = 2
var 5 name = c value = 1 refcount = 2
var 6 name = d value = 100 refcount = 1
var 7 constant value = 100 refcount = 2
var 8 name = e value = 5 refcount = 1
var 9 name = f value = 10 refcount = 1
var 10 constant value = 5 refcount = 8
var 11 name = g value = 1,33333337 refcount = 1
var 12 constant value = 6 refcount = 1
var 13 name = h value = 0,6666666 refcount = 1
var 14 name = i value = 2 refcount = 1
var 15 name = j value = 0 refcount = 1
var 16 name = k value = 2 refcount = 1
var 17 constant value = 0,166666672 refcount = 4
BENCHMARK: running 10.000.000x
execution benchmark (single thread c#): 27.149 ms
execution benchmark (parallel burst): 0.1333 ms
the referencecounts for constants are bugged… they are mainly used to remove divisions and replace the initial value with its inverse
through several steps, the input code is parsed into a token tree, then an analyser gets to check and optimize nodes according to some rules after which the compiler will output bytecode. The bytecode is then further optimized using pattern recognition. In this snippet the following things happen after parsing the token tree:
analyser ensures order of operation a = a + (a * (1 / 2) * 2)
the division is converted into a multiplication with its inverse as the division was determined to be with a constanta = a + (a * (1 * 0.5) * 2) note that the analyser will use reference counting to either replace the original value in the stack or to create a new constant value on the interpretor stack.
the analyser removes unneeded parenthesis a = a + (a * 1 * 0.5 * 2)
the optimizer kicks in and sees multiple multiplications and an addition and replaces it a = a + (mula(a, 1, 0.5, 2))
finally, the parenthesis are not needed anymore as now the multiplications are 1 function and operator precedence is not a problem anymore a = a + mula(a, 1, 0.5, 2) the interpretor can now very efficiently produce an execution path in runtime/burst from the 9 bytes of code the script resulted in 128 128 002 031 004 131 131 133 132 The assingment statement ‘set’ and all 'nop’s are not actually needed to correcly execute any code the interpretor outputs.
This all works better then expected and theres a lot more then this… i keep finding methods to improve the execution of the script by the interpretor.
The input/output at the start of the script defines the mapping for io vars, currently it directly maps with pointers
but it should become optional to copy or direct reference, bytesize is used not to determine the parameter types as those are inferred from their usage, it is used to determine if all parameters can be set on starting the script using a single native memset as they would be located in one block
input <id> [offset] [bytesize]; // map identifier to memory offset from the data used to run the script on
output <id> [offset] [bytesize] // map identifier back
Next up is control flows and vector types.
Some results:
The following code compiles into 59 bytes and a stacksize of 13 floats (this could be a lot smaller but it is not optimized yet), i takes 0.33ms on average to compile in single threaded c#
a = a + 1 * (1 / 2) * 2;
b = 2 * (2 * 2 * 2) * 2;
c = 2 - 2 - 2 - 2 - 2 - 2;
x = 1 & -2 | 1;
y = 1 & 2;
z = 1 & 2 & 3 & 4 & 5;
and results in the following benchmark:
1st run
2nd run
that is 10M times executed in 0.014s
just maybe ill get scripting paralel in burst just as fast or faster as doing it direcly in c# without script or burst, thats just awesome. A really large optimization is that burst makes a jump table of all switches as in c# i wont get it compile like that.
note that for testing it evaluates expresions leading to non-referenced values aka useless calculations are performed, this is for evaluation purposes only, in release you would never calculate anything as not a single value is returned via output.
# define a float4
f = (1 2 3 4);
# perform scalar * vector * scalar operation
g = 2 * f * 2;
COMPILE: 0,4084 ms
CODE SEGMENT: 15 bytes
001 128 132 133 134 135 000 001 136 133 005 128 005 133 000
ASSEMBLY:
set f 1 2 3 4 nop set g 2 * f * 2 nop
DATA SEGMENT: 12 floats
var 1 name = f [4] value = 1 2 3 4 refcount = 3
var 2 constant value = 1 refcount = 2
var 3 constant value = 2 refcount = 6
var 4 constant value = 3 refcount = 2
var 5 constant value = 4 refcount = 2
var 6 name = g [4] value = 4 8 12 16 refcount = 1
A vector is deduced from the data following an operation.
Most trouble was with the optimizer to check data sizes of variables while matching on bytecode patterns, the interpretor has very little knowledge of it other then it builds from the byte and a float[ ] with data. each operation > 128 is actually the offset (-128) into that table, maybe its best to have code and data share the same fixedbyte[ ] (with data using memorya backward offsetted from end index) as for the same sized icomponentdata the range of supported scripts can be larger or the number of parameters/stack vars can be larger for the same sized data block;
Im not sure if it should be just 1 IScriptData with a preset size defined per project, or if i should have different IComponentDatas with different sizes… for me 192 (i think i can get away with 64 bytes total for most usages for me, remember, unique script/entity) byte of shared code/data memory would probably be more then enough. Chunks like the smaller ones but i really want to test soon how much structure size influances performance.
next up:
constant vectors, could ditch most of the scalars then from stack (only the ‘2’ remains)
functions, math utilities mosly, anything to get smaller code for repeating tasks would be nice too… option to map a few byte opcodes to native function pointers for flexibility or is that way too much…
control flow, ifthen, while etc.
small constant values should use non-used operand and varid opcode space, these would then not needed to be fetched but it would only work for small integers and booleans; the interpretor should be able to figure it out by using 1 byte as index for the id operation (128 now)
booleans, its very tempting to use 1 64 bit datatype and intrinsics to manipulate it and make 64 bools available to the script for the cost of 4 bytes of memory, boolean operations work but they use a float as backing field…
lot of testing.
it wont work on mobiles as i used sse intrinsics in the interpretor and plan to use it a lot more for vector math, i dont know if a neon code path would be nice or just a safe version for burst to figure out… too much questions time to sleep and work tomorrow
i guess my biggest question is how to organize this nicely for reuse, when it has a data stack and it shares chunk memory space with your entities then it seems very wastefull to have duplicates of data being copied to and from the script, pointing to everything would force unsafe code (it is now but should not be later), having properies on each script means probably generating code to compile with your project for easy data access if the only place it is stored is in the script’s stack…
[B]CODE:[/B]
aa = 1 + 1 * (1 / 2) * 2;
ab = 2 * (2 * 2 * 2) * 2;
ac = 2 - 2 - 2 - 2 - 2 - 2;
ax = 1 & -2 | 1;
az = 1 & 2 & 3 & 4 & 5;
aw = 1 * 10 * 3 * (3 + 4 * 2);
[B]CODE SEGMENT: 72 bytes[/B]
024 018 037 004 085 085 107 086 019 001 128 085 002 025 000 001
129 037 005 086 086 086 086 086 000 001 130 039 086 086 086 086
086 086 000 000 001 131 085 006 003 086 007 085 000 001 132 040
005 085 086 087 088 133 000 024 018 033 088 086 087 019 001 134
037 003 085 090 087 005 025 000
[B]ASSEMBLY:[/B]
push ( mula / 1 1 0,5002 ) set aa 1 + pop nop set ab mula * 2 2 2 2 2 nop set ac suba 2 2 2 2 2 2 nop nop set ax 1 &- 2 |1 nop set az all * 1 2 3 4 5 nop push ( fma 4 2 3 ) set aw mula - 1 10 3 * pop nop
[B]DATA SEGMENT: 7 floats, 28 bytes[/B]
var 1 name = aa value = 2 refcount = 1
var 2 name = ab value = 32 refcount = 1
var 3 name = ac value = -8 refcount = 1
var 4 name = ax value = 1 refcount = 1
var 5 name = az value = 1 refcount = 1
var 6 constant value = 5 refcount = 1
var 7 name = aw value = 330 refcount = 1
all nesting is converted into linear execution using a stack, this wil void all recursive paths in the interpretor so there is no cost in maintaining function stacks by the os for deep recursive calls, there is some extra stack use but that is lighning fast compared to a function call nesting yet another compound iteration. From now on the interpretor follows a completely flat execution path.
constant values are reference counted and the most common are encoded in opcodes, saving data space while costing no additional code space, a future update will allow an extended instruction to load constant vectors from the interpretor, i bet that by analysing all scripts in a project the compiler can make a very neat list of 255 vectors[n] that are used most often and make sence as a constant encoded via an opcode; this will save a lot of memory in the data segment
Hard too see in bytecode but this is the result for a nested If then / while and a switch statement, flattened and with understanding of vectors of max size 4 x 63:
set aa 2 nop push ( aa - 1 ) jz @jmp_0 25 ( aa & pop nop ) ( set aa 42 nop set ab mula yield 42 1 1 1 nop ) jump 6 @jmp_1 ( @label_0 set aa 2 nop ) @label_1 set ab 1 nop push @label_3 ( 2 - 10 ) jz 23 @jmp_2 ( aa > pop nop ) ( set aa aa - 1 nop set ab ab + aa nop ) jump_back 29 @jmp_3 jz @label_2 13 @jmp_4 ( ab >= 826 ) ( set ac 2020,2020 nop jump 22 ) @label_4 jz 14 @jmp_5 ( ab >= - 34 ) ( set ac 1010,1010 nop jump 6 ) @label_5 set ac - 100 nop
and its getting fast… i cant run a benchmark (next update) at the moment as im completely refactoring all code (it became a bit messy) and i forgot some ‘minor’ thingy… indexers like . and [ ], after that its time to start thinking about packaging blobs and making it simple to use. We could seperate code data and stack and depending on the task at hand their final memory location might be very important for maximized performance so i really want to have decent control about layout and destinations in several configurations. You might want to run script code from a manager using only data from a chunk and a stack based on a local stackallock [ ], or you might want something completely different.
Then finally for now, the kicker: 2-n wide execution, with simple scripts and very, very many of them we are bound to be able to sort and select execution paths were only data changes, and we should not have to do that every frame akin to how the ecs systems organize themselves, we should be able to burst through that as long as scripts dont use conditional flow control. (and even then, on an unlikely jump that mismatches, we could just take out that 1 path and run it single. The interpretor is flat for ease of coding that, i dont see any problem not solvable but i doo see many fun ones
My purpose was to build somekind of tool to run many changing behaviours as fast as possible with the behaviour determined elsewere but i guess it will still be a while… there are many topics on maintaining data structures for changing behaviour and they are all complicated because they want AND performance AND flexibility AND are prone to change during development AND with the rigid setups needed to run fast code though c# can be extremely time consuming and not very rewarding given the time spent on it. With a very fast scripting language i hope to solve that problem for my company as i can have just some instuctions at each node based on some higher level representation elsewere. When code is deciding it can always use the higher level repros but on stepping each frame many small scripts in chunks seem a clean solution somewhere midway, with the benefit you can completely pull out and replace its content without code changes. it all becomes data seperate from other processes. We will probably release a compiler/interpretor package to the store in a while for the price of a hamburger or something… we will also be bringing in a compiler specialist who will help improving the compiler before releasing it as we damn wel be sure its output is always correct and/or validated), that wont be free im afraid.
I started to make the parts that make it usable, its purpose is to on demand change functionality, either from input, ai or some server. While the interpretor got very fast, i will never be as fast as native code and one big drawback of using a script interpretor for core functionality is performance, most parts of the scripts will be static and running it with an interpretor would be a waste of cpu-cycles, hard coding it would mean making the code more complex if you want to keep supporting on demand scripts.
For this reason the scriptcompiler has some stages added that allow it to output c# burstcompatible code compiled from the scripts during design time. It will use functionpointers to call the native burst code if the script was known at compile time.
This has several nice advantage’s: a) speed, it is much much faster, b) we can treat all scripts equal regardless what they are compiled into resulting in clean and compact code.
We can just use the same method everywere and still update parts during runtime without losing performance or getting complex code paths.
At some point a static repo is kept which is scanned with reflection for scripts:
Then we can use them everwhere from a manager:
int res_hpc = mgr.Execute(script_id, data);
Performance is very enjoyable (note the warmup on the first run, each is 1000x, with time div 1000):
As expected, the interpretor is much slower, needing 0,141 ys for 1 iteration, while the native bursted version takes about 0,004ys :). (EDIT:the benchmark is off… il2cpp replaced the whole script with a packed move… forgot it applies constant analysis and i dont for testing… time is higher but still very much faster, as would be expected) The interpretor will get faster by executing multiple equal scripts with different data wide but that will probably halt at 8x speedup and depening on the jumps in the script it might take some instruction padding (nops) to keep all in sync if even possible (many conditional jumps will be hard to solve and it probably will be useless so…), otherwise it will drop paths forcing them back in single execution.
It might seem wierd at first glance to first do all this trouble to end up with c# code again but this ensures we always have the same behaviour and most importantly: there is no drawback to using the script generally when only small parts are expected to change during runtime, and in a weekly application update important script might become hardcoded f who knows, in the meanwhile a lot of clients on an older code base while still executing the same functionality although with a (in real life) neglegible performance hit.
The script used in this test is very simple, 4 float multiplication statements, the compiler stage to c# isnt very finished as i wanted to see it work first :- also the handling of data could use some love, i guess we need to play a little with it in order to find out the nicest way to use it, ideally it would just read/write into a component data and just use a pointer to it but there are more usecases… this needs some time toying with it.
the package when compiled into 1 combined segment looks like this, but you can have each segment seperate so if you want you can have your code in some global repo and the data tightly packed in componentdata (or the other way around), for saving space, the interpretor can use a stack local to its stack saving space as to fit more datasegments into 1 chunk. (you cant use yield in that situation because it uses the stack to preserve state) We want it to be as flexile as possible but at the same time we dont want 1 big mess so this is kinda open atm in how to do this cleanly for all users.
public struct BlastPackageInfo
{
public BlastPackageMode package_mode;
public byte allocator;
public byte unused_1;
public byte unused_2;
public ushort code_size;
public ushort stack_size;
public ushort data_size;
public ushort package_size;
}
public struct BlastPackage96
{
public BlastPackageInfo info;
unsafe public fixed byte segment[96];
}
I see, so it should be very easy to save it all, be it runtime code or not.
I wonder:
Now that you can import runtime burst assemblies, couldn’t this be done in an unloaded assembly at runtime and then load it? I think it was introduced in 2021.1?
I saw there is support now for adding unloaded assemblies but i dont see a way to compile with burst at a client, we use csc.CompileAssemblyFromSource on windows (some testing path), could possibly use compilerservices in mono builds but it will be plain c#. we have codepaths that use the .net c# compiler to generate il in runtime for the scripts compiled ast instead of byte code but for now it will only function on windows and compile time is so slow that it gives management trouble we dont want now, in cases it takes seconds to compile a script in runtime and to justify that it would need to execute said script many 1000s of times. Its gonna be a tradeoff and at this point in the project we are looking how we can have a consistant interface for all of this.
to be honest we have assumed that using the burst compiler from runtime is impossible and havent even looked into it
update: if i only assign a constant in a test script the difference between interpretation and native is only about 15x. with wide execution we can probably get very close. it will be interesting to see what normal managed c# would do with the same script, can wide execution of small scripts be faster then managed c#? i think/hope so. but that test will take some time as it will need most components finished