I’m processing a bunch of texture data (NativeArray) in a Job and I did a deep profile on it and I discovered that something like 70% of my processing is just implicitly converting half4’s to float4’s because halfs don’t have any operators (+, -, *, /) defined.
I’d prefer not to write bitwise float operations myself. But I guess it’s either that, convert the whole array during asyncGPUReadback, or using float4 textures on the GPU across the board.
Anyone with enough experience to weigh in? Am I on a fools errand?
Yeah, CPUs have no hardware half support. You’re essentially emulating it in software, and that’s pretty slow. I think the idea is that you do all your math in float and invoke conversion between half and float as rarely as possible.
There is native hardware support for 16 bit floats in very specific intel CPUs AVX-512 - Wikipedia
But until Burst supports AVX512 and its’ numerous instruction set extensions(LLVM, the compiler behind Burst does support it), your best bet is to convert your half vectors to float vectors explicitly at the very beginning of your job, perform your math on floats and convert back to half vectors when writing back the results.
Also, F16C (the instruction set extension that adds hardware support for half<->float conversion) is bound to AVX2 being supported, in Burst at least. Without AVX2, conversion goes up from ~4 clock cycles to at least 30. Previous to Burst 1.5 that was also the case with AVX2 being supported, so make sure to have that Burst version installed at least (yay me - I reported that bug).
If you’re not compiling for X86, I THINK there’s nothing you can do about it at all. But yeah - I only know X86 so I really don’t know for sure.
My saving grace is that I don’t actually have to write back to the texture, I just have to read from it, a lot, at semi-random locations. (It’s half4’s are simulated ocean-wave data, and my job is doing buoyancy)
Have you tested doing the conversion during readback? I don’t know how that’s implemented, but if the conversion happens on the GPU it might use the fast hardware conversion mechanism used by sampling, although it would double the readback bandwidth and I have no idea how that tradeoff would go.