Support externally quantised int8 ONNX models in Sentis

Hi! Great to see the progress in Sentis since the last time I used it around a year ago. I’ve now started work on a large project using it.

However, I am unable to import my uint8 pre-quantized ONNX models, as the DequantizeLinear, DynamicDequantizeLinear and QuantizeLinear operations are (no longer) supported by Sentis! I DEFINITELY used models with these operators in an earlier version (1.3.0-pre.2), but for some reason support has been removed (indeed, even the docs for 1.3.0-pre.3 list these operators as unsupported). Why is this?

I know Sentis has its own model quantisation ability now, and that’s great. However, quantisation is a complex thing, and a whole research subfield of its own: I’m not sure which method Sentis is using (would be nice if the docs mentioned it!), but I suspect it is either dynamic quantisation (as there is no way to provide calibration data) or the weights are simply dequantised when loaded into memory. Either way, it certainly isn’t providing calibrated static quantisation, which is what I want - this is substantially faster than dynamic (33% faster in my experiments in Sentis back when it was supported). Static quantisation requires the currently unsupported DequantizeLinear and QuantizeLinear operations (or the alternative format of QLinearConv etc. which are also unsupported).

Advantages to supporting pre-quantised ONNX models:

  • There’s lots of quantised ONNX models available on HuggingFace - it’s a lot easier to download one of these than to download the full version and quantise it yourself! They will also likely perform better.
  • Sentis quantisation requires importing the ONNX file, then quantising locally, both very intensive operations (particularly with regard to RAM). For really large models this might be impossible for most folks due to limited RAM - it’s more feasible to quantise a big model on some workstation and download the resulting small ONNX file.
  • Quantisation is sort of an art… there really are a lot of papers on how best to quantise neural networks (particularly for static quantisation, which is the most efficient so v relevant for in-game stuff). For big models it can make a huge difference. I have a setup on a workstation where I quantise models in different ways and compare their speed/accuracy, so I can select the one which gives the best speed/accuracy trade-off, and I suspect there are quantised models on HF which are similarly carefully crafted.
  • It was previously supported…? So surely shouldn’t be hard to add it back?

Please let me know if I’ve made a mistake, or there’s some workaround which lets me use my uint8 quantised ONNX files in Sentis. At the moment it seems my only option is to import the full version and quantise locally, which does work fairly well, but my pre-quantised ONNX files will be faster and/or more accurate as they have been statically quantised, calibrated to the dataset and benchmarked!

Supporting DequantizeLinear, QuantizeLinear and DynamicDequantizeLinear is really important in my opinion.

Cheers,
Z

EDIT: I am using Unity 6.0 and Sentis 2.1.1, but I guess it doesn’t matter much as the operators are listed as unsupported in every version on the docs

Hi ladismad,

Thank you for your message. Upon thoughtful reflection, we decided to not support importing ONNX quantized models. There were support considerations, and performance is not always up to par. Using our own api provides more flexibility and allows us to achieve quantized performance speedup.

For that reason, operators DequantizeLinear, DynamicQuantizeLinear and QuantizeLinear are not supported.

Our suggested work path is to import the unquantized model and to quantize it with the Sentis api as described in Quantize a Model | Sentis | 2.1.1.

I understand that this may not be a suitable solution from your point of view. I will bring your concerns to the Sentis team.

Viviane

Which quantized models on HuggingFace would you like the most to be supported?

@montplaisir Hey, thank you for the reply, and for the information!

There isn’t a particular HuggingFace model I’m interested in as I’m working with a custom model which I have quantized. If you look at models with ONNX on HF there are sometimes quantized versions provided, such as here: sentence-transformers/all-mpnet-base-v2 at main. To be fair, there aren’t a huge number of quantized ONNX models on HF.

EDIT: Intel/distilbert-base-uncased-distilled-squad-int8-static-inc · Hugging Face is a good example of a ‘high effort’ quantized model. It uses static quantization with dataset calibration, and they evaluate it compared to the original, so it’s likely to perform better than using Sentis to quantize without any calibration (though it is unclear by how much).

My assumption was that it should be easy to convert a quantized ONNX model to .sentis format, given that Sentis has support for quantized models already, but I am lacking understanding here so I appreciate it may be complicated.

It would just be a nice feature to have, as quantizing large models within the editor is a bit problematic at the moment (I had to reboot my laptop to get enough RAM to avoid crashing when importing + quantizing a 1.7GB fp16 model). But not a must-have if difficult to implement.

Thanks!