Could you please explain how is the original whisper tiny model exported to sentis format?

Seto1 · April 4, 2024, 7:42am

Could you please explain how is the original whisper tiny model exported to sentis format?
Thank you.

Seto1 · April 4, 2024, 11:33am

I know that I can use optimum to export encoder and decoder. And I confirmed that it can be replaced by those onnx.
I’d like to know how the LogMelSpectro model is exported.

alexandreribard_unity · April 4, 2024, 4:01pm

The trick is to be able to export the STFT (onnx is not able to export that)

class LogMelNet(torch.nn.Module):
    def __init__(self):
        super(LogMelNet, self).__init__()
        self.filters = mel_filters('cpu', whisper.audio.N_MELS)
        self.logmelmodel = STFT(win_len=400, win_hop=160, fft_len=400)

    def forward(self, audio):
        magnitudes = self.logmelmodel(audio)
        mel_spec = self.filters @ magnitudes

        log_spec = torch.clamp(mel_spec, min=1e-10).log10()
        log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
        log_spec = (log_spec + 4.0) / 4.0
        return log_spec

Seto1 · April 4, 2024, 4:16pm

Thank you. I found out your previous post as well.
But I encounter some problem.
I got this problem.

ValueError: not enough values to unpack (expected 2, got 1)

github.com

echocatzh/conv-stft/blob/master/conv_stft/conv_stft.py#L198


      
          def forward(self, inputs):
              """Take input data (audio) to STFT domain and then back to audio.
          
              Args:
                  inputs (tensor): Tensor of floats, with shape [num_batch, num_samples]
          
              Returns:
                  tensor: Reconstructed audio given magnitude and phase.
                  Of shape [num_batch, num_samples]
              """
              mag, phase = self.transform(inputs)
              rec_wav = self.inverse(mag, phase)
              return rec_wav

Seto1 · April 4, 2024, 4:18pm

I followed your previous post here.

alexandreribard_unity · April 4, 2024, 4:30pm

shape error maybe?

audio = torch.randn(1, 16000*30)
logmel = whisper.log_mel_spectrogram(audio)
logmelmodel = LogMelNet()
torch.onnx.export(logmelmodel, (audio), "LogMelSepctro.onnx", export_params=True, do_constant_folding=True, input_names = ['audio'],  output_names=['log_mel'])#, dynamic_axes={'audio_input' : {1 : 'n_mels', 2 : 'n_ctx'}})
logmel2 = logmelmodel(audio)

this works

Seto1 · April 4, 2024, 4:39pm

What’s your version of pytorch and conv-stft?
When I use 2.2.2 and 0.1.2(I don’t know why the pypi shows 0.2.0 but it installs 0.1.2), I encounter that torch.rfft is not defined problem. If I use torch.fft.rfft. It still reports error.
When I use 2.2.2 and the github source version, I encounter the problems I previously mention.

Seto1 · April 4, 2024, 4:48pm

Here’s a colab notebook reproducing the problem.

alexandreribard_unity · April 4, 2024, 5:02pm

python 3.9, conv-stft latest

Seto1 · April 4, 2024, 5:11pm

I forgot to give the permission to the link. Could you please have a look at the link? I follow the same instructions you mention.

Seto1 · April 6, 2024, 10:11am

Can you please have a look at the reproducing notebook?
Can you provide the original LogMelSpectro.onnx?
The one in the repo is sentis format. I try to make a backend of MindSpore Lite which converter supports onnx model for Sentis.

Thank you.

Seto1 · April 7, 2024, 9:18am

I finally fix it with these lines. Use the latest version of conv_stft.

        stft = self.logmelmodel.transform(audio, return_type='magphase')[0]
        magnitudes = stft[..., :-1].abs() ** 2

SugiuraTsukasa · May 20, 2024, 1:15am

@alexandreribard_unity Related to this topic, please share how to export the Whisper model (encoder, decoder, logmelspectro) that Unity published on HuggingFace by way of scripts, notebooks, blog posts, etc… We are interested in other size models. It will help a lot of developers. Thanks,

alexandreribard_unity · May 20, 2024, 2:06pm

We can do a medium blog post about it since it seems useful yes

Atabek · June 8, 2024, 8:51pm

If You don’t have time for a medium blog post, you can share a Google collab example. I also encountered this and need a good guide. Also thinking about ONNX Olive all in one solution encode + decoder combined single model. Is it possible?

Topic		Replies	Views
Some whisper / stable diffusion results Unity Engine Inference-Engine , Question	19	1762	January 21, 2024
Model didn't import: ljspeech-jets-onnx Unity Engine Inference-Engine , Question	31	2274	February 4, 2025
Struggling with audio spectrogram tensors Unity Engine Beginner , Audio , Inference-Engine , Question	2	384	May 4, 2024
Spleeter Vocals Isolation on Unity Sentis Unity Engine Advanced , Inference-Engine , Question , 6-1	1	152	August 3, 2025
Exporting models to ONNX to be usable with Sentis Unity Engine Inference-Engine , Question	4	922	February 23, 2024

Could you please explain how is the original whisper tiny model exported to sentis format?

Related topics