I’ve been playing fast and loose with the conversions between these types both in burst jobs and outside. Most APIs and old code still use the UnityEngine.Vector3 and friends types. When writing a Burst job, sometimes I will reference some function that uses the old types, or the input/output (for example, a mesh) might use the old type, or I’ll just forget.
So my question is, is it OK to assume that conversions between these types have no penalty (in a release build) and all the compilers (Burst, Mono JIT, il2cpp, .NET 5 if that ever actually happens) will be able to optimize both equally well? Or should I be worried and try to use only one of them? (in which case, I’ll kind of have to use Vector3 et al except in very specific circumstances since so many APIs rely on it).
Thanks!
2 Likes
Conversion in Burst and C# code is free.
I’d think that many operations on Vector3
types will be performed with SIMD operations in Burst. Although that might very well not be the case; Types that are not an entire 128bits in width are handled very poorly by the LLVM compiler (which is where Burst hands over your C# code to). float3
gets some special treatment by Burst - don’t know about Vector3
but I’d say that that is not the case.
Personally I’d stick to converting back and forth between float3
and Vector3
where possible.
3 Likes
I mean conversion between a Vector3 and float3 and back. The op_implicit call should be free (except maybe in il2cpp that has a branch in every single method call). The question is whether the various compilers are smart enough to figure it out.
When it’s sitting in registers, doesn’t it just extend them?
Yeah I also meant Vector3
to float3
conversion and back.
One of the top priorities of optimizing compilers is to eliminate unnecessary memory reads and writes. First, since Vector3
is a value type, the hardware actually cannot do anything else (sic! - anything relevant) than load 3 floats onto the stack to load the initial Vector3
/float3
from memory. Then each float is loaded into registers and written back onto the stack one by one in the same order(v3.x = f3.x; v3.y = f3.y; v3.z = f3.z;
). That is a dead operation - it will be optimized away even by Mono JIT :).
If IL2CPP really adds stuff to all methods, you can at least do…:
unsafe
{
float3 myfloat3 = *(float3*)&myVector3;
}
Much more ugly than an implicit operator, though. I thought the branches you’re talking about are related to null checks and bounds checks, which can be turned off(and might not even occur with structs)?
The problem is probably related to that, even - getting them into and out of registers, I mean. Loading and/or writing 4 or even 8 floats a.k.a. 128 and 256 bits, respectively, can be done in one hardware instruction. Doing the same with 3 introduces at least one more instruction with at least 3 clock cycle latency and the same when storing them, plus: compiler has to keep track of the fact that it is only allowed to write back 3 (and reading 4 may cause a memory access violation). I guess that the compiler engineers decided that this non-loop vectorization is not worth it and instruction level parallelism beats SIMD in this case, not caring to optimize code gen for it.
I can see where you’re coming from, though. I had the same question when looking at some code gen for interactions between the byte8
and bool8
types in my Unity.Mathematics
extension library using low level hardware intrinsics. It was horrendous, even including stuff like movq xmm0, xmm0
(move 8 bytes from register xmm0 to xmm0, zeroing out the upper 8 bytes of xmm0… even though only the lower 8 bytes were written back to memory) among other stuff, which completely goes away when filling up 16 bytes. So even when you hand code the conversion of your custom types to SIMD registers and back, with hand coded hardware instructions in between, LLVM doesn’t optimize the code whatsoever. I’ve seen it with my own eyes and someone from the Burst team told me, too - why that is the case, though, I don’t know. I assume that loop vectorization is way more important than this type of vectorization, which always fills up an entire SIMD register. Along with the reasons I mentioned above (overhead => ILP > SIMD).
2 Likes