Struggling with audio spectrogram tensors

Hi there !
I recently had a lot of fun playing with tinyWhisper, jets, and tinyStories models you put on huggingface (amazing work btw), and running them in a Quest2 headset.
I am now attempting to build a reliable speech recognition process, but am encountering a lot of domain-specific words that the model does not recognize. Running larger models proved to be too resource-intensive so far, so I wanted to try training my own light keyword recognition model, like in the tensorflow tutorial.

My issue is not being able to match the spectrogram input in unity (the tinyWhisper LogMel) with the one used for training (tf.signal.stft). As I understand LogMel is an additional step to the short time fourier transform, but I am fairly lost trying to replicate it in training.

If you can give me some directions for making audio input tensors that are consistent between Sentis and Jupyter/Colab, I would be very grateful :slight_smile:

Cheers !