Some whisper / stable diffusion results

Oh boy you took the hard way… But congratulations on implementing a C# STFT.
I think we’ll add support for it natively in Sentis in a upcoming release.
I did the following:

import torch
from import mel_filters
from conv_stft import STFT

audio = torch.randn(1, 16000*30)

class LogMelNet(torch.nn.Module):
    def __init__(self):
        super(LogMelNet, self).__init__()
        self.filters = mel_filters('cpu',
        self.logmelmodel = STFT(win_len=400, win_hop=160, fft_len=400)

    def forward(self, audio):
        magnitudes = self.logmelmodel(audio)
        mel_spec = self.filters @ magnitudes

        log_spec = torch.clamp(mel_spec, min=1e-10).log10()
        log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
        log_spec = (log_spec + 4.0) / 4.0
        return log_spec

logmelmodel = LogMelNet()
torch.onnx.export(logmelmodel, (audio), "LogMelSepctro.onnx", export_params=True, do_constant_folding=True, input_names = ['audio'],  output_names=['log_mel'])#, dynamic_axes={'audio_input' : {1 : 'n_mels', 2 : 'n_ctx'}})

That gives you a onnx file that transforms a audio file to a logmel spectro that you can feed into whisper