Burst auto-vectorisation can be fraught. This comes partly out of working with loops-within-loops - the auto-vectoriser isn’t transparent about how it sees such things. So I started looking at Unity.Mathematics, suggested by Unity Team for vectorising code explicitly, assuming that it has what I need.

But the only 16-byte-wide type is bool4x4 and it doesn’t provide arithmetic ops, only logical ones. (I could write an extension method or two, but not sure if I’d need to explicitly support different architectures in each?)

I would really like to get a greater-than operation that can operate across (ideally) a 16 byte / 128 bit wide vector, thus solving my problem and covering most platforms without need for platform-specific code blocks on my part. I’ve looked into the Unity.Mathematics sources (*.gen.cs), not seeing any arch-specific blocks there. But maybe I have to implement something like that.

I realise Mathematics is designed to look like HLSL (of which I approve), but it would be really nice if there were a byte4x4 type.

Any Ideas, Suggestions & Recommendations on how you would do this, are welcome!

It’s to accelerate a neighbours / adjacency matrix check while walking a graph. Current spec has up to 10 neighbours per node. Primarily need a > than op across all 10(16) entries. Could drop this spec to 8, but since 128-bit / 16 byte wide buses with v16 ops are possible under the hood, and I need only 1 byte per entry,16-component op seems right. Just need a sane way to do this… I think intrinsics or auto-vec are the only possible approaches. I need to know which of the 10 neighbours has the greatest value… perhaps there is not an intrinsic for this.

EDIT: OK, what I’m looking for is termed a horizontal op and is not a good fit for SSE/AVX etc. One can shuffle things around a bit to get a solution based on intrinsics, but this will have additional overheads.

I have written an extensive library that extends Unity.Mathematics (Link in my in my signature).

Among the extensions are, of course, (s)byte16 and even (s)byte32.

It uses Burst’s hardware intrinsics and compiles down do optimal code as long as you use 128 and 256 bit vectors, respectively. The other ones are still almost optimal, but not 100% optimal. That’s just LLVM. I suggest using the 256 bit types - if AVX2 is not supported, the code falls back to performing the same operation on 2 128 bit vectors.
PLEASE NOTE: I only used X86 intrinsics and wrote a managed C# implementation, so that library won’t help you in case you need to compile down to anything else.
PLEASE NOTE: There is actually no hardware support for greater than comparisons of UNSIGNED byte types. What you’ll actually see is (sbyte)(x ^ (1 << 7)) > (sbyte)(y ^ (1 << 7)) - just so that you’re not surprised when you see the assembly.

PS: bool4x4 actually, at an assembly level, uses 4 SIMD registers. My byte4x4 type does, too, just to be consistent.

Good to see someone working on this because I feel like right now, for many CPU applications, Mathematics is far from ready, given what AVX/SSE/Neon are capable of under the hood. Current adoption of that superset (or more appropriately, the union of those sets) appears miniscule.

Once it becomes absolutely essential, I will be giving your library a try. Thanks!