I loaded in a 1.6GB model in float16 format. The weights in the asset folder say 3.3GB.
Seems like it’s storing it as float32 instead of float16.
I loaded in a 1.6GB model in float16 format. The weights in the asset folder say 3.3GB.
Seems like it’s storing it as float32 instead of float16.
Correct, fp16 is not supported atm.
Only fp32 inference and storage.
Not supporting fp16 would be a deal breaker to me personally. fp32 takes twice the amount of space and is slower on some GPUs.
My practical example would be running stable diffusion. Overall this would be about 5GB in float32 format and 2.5GB in float16 format. While most people have 4GB VRAM, so you can see why this would be important.
I see this is a bit of a “catch-22” since float16 will run very slow on CPU. But honestly people without a GPU (which all run float16) aren’t going to be able to run neural networks at a good enough speed anyway.
Actually for large language models, an even better storage solution is uint8. Which takes 25% of the storage space. These can be dequantized and run as either float16 or float32 with barely any degradation. Or run directly on the GPU as uint8 which can be slower as the dequantization is done during inference time.
Adding some caveat that we do not load all weights on GPU all the time.
VRAM usage is equal to the largest concurrent tensors more or less. So not nearly as much as 5GB usage.
Going fp16 will save you some VRAM, but probably not a considerable amount. You’d save on disk space that’s for sure.
In so far as perf expect roughly a 10/20% speedup.
uint8 math would be better in all regards, but we haven’t seen many models that work in that mode.
FYI, fp16 is not out of consideration, we might add it in a future release if there is large enough demand for it.
Hey all, just wanted to let you know that we are working on a fix for this, known as issue 29. Stay tuned.