Some whisper / stable diffusion results

Hey all, I had some fun adding whisper-tiny and u-net stable diffusion models to Unity.
The main challenge is not having a string tensor to input and receive output or convertor, and the RAM memory these models hold until Unity is closed. The reloading of assemblies takes 20-30 seconds.

Whisper-tiny video demo

stablediffusion video demo

Whisper-tiny is super fast, stable diffusion 15 steps take about 40 secs, 25 steps over 1 minute:
1 step - promissing!

2 steps

5 steps

10 steps

15 steps

25 steps

I am developing on a Win11 pro, spec 64GB RAM; CPU 12th Gen Intel i9-12900K, 3400Mhz, 16 cores; GPU NVIDIA GeForce RTX 3090 Ti 24GB. Stats below running in Unity the stable diffusion scene with nothing else running but Unity.

If I run 15 steps RAM reaches 55% use and holds at this level thereafter. Playing Unity multiple times without closing it raises RAM usage to 70% and holds. GPU maxes at 66%.
The reloading of assemblies takes 20-30 seconds.

A good starting point for proof of concept, all great fun!


Nice! At first I thought the whisper model was creating the speech so I thought “that sounds realistic!” But it’s transcribing the speech. Still good though.

I see you have quite a hefty computer setup. I think most people will have about 4-8GB RAM so yes, I think getting RAM down will be a big deal.

If you want a simple text vector just look up the tokens for START_TOKEN, “cat”, END_TOKEN in the vocab file and run them through the text_encoder. Should work as a quick test. The start and end tokens might not even be needed. Are you using guidance?

Apart from RAM leakage, a trick I used when using ONNX Runtime to get RAM down, was to convert all weights into inputs, then load the weights one by one from external files and bind them to the GPU as soon as they were loaded. That way hardly any RAM was used at all. Not sure this is possible yet with Sentis as it seems like you can only put the inputs on the GPU at execute time which means they have to sit in RAM until then.

Good work :smiley::+1:

That’s funny that you have to do that @yoonitee
The issue with RAM and model is that the asset manager keeps the model in RAM all the time after import.
Normally by closing and re-opening the project this goes away.
During run time we shouldn’t load all weights unless used, so should be the same as your setup.
But we’ll investigate

Hey @yoonitee apologies for the late response and thanks a bunch for your comments, I am new to AI and it’s been a lot of trial-and-error to get it working.

Performance-wise, I am not sure if maybe the models require optimizing, but it’s so RAM hungry. @alexandreribard_unity I didn’t try to deploy yet but if all weights are used I can only guess the resources consumed would be similar and that’d make this challenging anywhere. Thanks for this guidance, I will also do some research how to discriminate as to what weights I can do without.

Really appreciate the pointers to improve @yoonitee , they sound really cool, I will try them out as soon as I’m back! :partying_face:

Yes I’m using guidance with scale 7.5 though I didn’t fudge with this yet.

Thanks again for your encouraging words ! :sunglasses:

@RendergonPolygons No problem. Yeah its not easy to get Stable Diffusion working with all the different parts so you’re doing well. One thing to remember is to dispose of tensors, before reassigning them. When I got my version it did use a lot of RAM to start but then it went down after it went onto the GPU. So maybe you just have a leaky tensor somewhere :slight_smile:

@alexandreribard_unity I only had to do the weight/input trick when trying to run really large models with ONNX runtime (not sentis) such as 5GB LLMs. Since even in float16 format it was maxing out my 12GB RAM. It went from maxing out my RAM to using virtually none since the weights more or less went straight from the hard disk to the GPU. Something similar might work in Sentis to run giant models with limited RAM. IDK. I had some more details here: GitHub - pauldog/FastOnnxLoader: Loads in onnx files with less RAM


Would love to know how you got the whipser model to run. I converted to onnx but keep getting stuck on the input

Hey @Clrj14 happy to help, can you give more details at what step I can help and error shown please?

Hi @RendergonPolygons . I just wondered have you seen the sample in the Unity package where it lets you run a few layers of the mode at a time per frame. So it means you can keep the main graphics going while it is thinking about AI stuff. Just wondered because you are into 3D art and such like, how you think the real-time aspects of all this thing works for interactivity and such like?

Hey @yoonitee I wasn’t aware and I’ll look for it, thanks for sharing. You raise a very interesting aspect, and yes it can yield super cool results, but I have 't thought much on this yet tbh. I’m still getting my hands around the basics really.

Hi guys! @RendergonPolygons @yoonitee

I am trying to run Whisper on Sentis, I have done the export to ONYX and imported the model.

Beyond converting the sound to mono and making sure it has the same frequency as the Whisper model, what function do you use to tokenize the sound?

Do you create your own functions or use a public library? I’ve been searching for a couple of days but i haven’t found anything to do the tokenization process or if you know any guide to create a tokenize function i would appreciate it a lot.

Whisper you need to do the following:

1 Like

@alexandreribard_unity Thank you for your reply!

I have some doubts, what library do you use to convert an AudioClip to LogMel in C# or do you use some type of bridge to invoke function from C# to python and then get the result?

When you indicate in point 3 to pass empty tokens is an array of null with the size of the AudioClip?

For step 4 I have found some options in C# like TiktokenSharp and SharpToken.

Last I checked the OpenAI started with [50258, 50259, 50359] so I used that
LogMel you need to dig a bit deep into what actually is happening.
If you check the code of what the logmel transform is you’ll see that it is a Short-time Fourier Transform
Ofc that is a complex number operation, so either you write your own or you can write it as a series of 1D convolutions which you can export a onnx from or write it directly in sentis
(GitHub - echocatzh/conv-stft: A STFT/iSTFT written up in PyTorch using 1D Convolutions this can help)

We could release that as a sample project if it is of interest


Hello @alexandreribard_unity !

I would appreciate your help a lot in some example I have made an implementation in pure C# without using sentis for stft and it is quite slow about 2 minutes for an audio of 30 seconds only in the execution of the STFT.

If you had any example at hand of the implementation in Sentis for the calculation of STFT, I would appreciate it a lot or an example of the complete flow, it is also sure to be of help to many.

I have tried using this example to replicate the transform and the inverse in sentis but I have had no luck.

Thank you very much and greetings!

Oh boy you took the hard way… But congratulations on implementing a C# STFT.
I think we’ll add support for it natively in Sentis in a upcoming release.
I did the following:

import torch
from import mel_filters
from conv_stft import STFT

audio = torch.randn(1, 16000*30)

class LogMelNet(torch.nn.Module):
    def __init__(self):
        super(LogMelNet, self).__init__()
        self.filters = mel_filters('cpu',
        self.logmelmodel = STFT(win_len=400, win_hop=160, fft_len=400)

    def forward(self, audio):
        magnitudes = self.logmelmodel(audio)
        mel_spec = self.filters @ magnitudes

        log_spec = torch.clamp(mel_spec, min=1e-10).log10()
        log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
        log_spec = (log_spec + 4.0) / 4.0
        return log_spec

logmelmodel = LogMelNet()
torch.onnx.export(logmelmodel, (audio), "LogMelSepctro.onnx", export_params=True, do_constant_folding=True, input_names = ['audio'],  output_names=['log_mel'])#, dynamic_axes={'audio_input' : {1 : 'n_mels', 2 : 'n_ctx'}})

That gives you a onnx file that transforms a audio file to a logmel spectro that you can feed into whisper


Thank you!!! @alexandreribard_unity i will try this :slight_smile:

Hello @alexandreribard_unity ,

I’ve been trying for a couple of days but I haven’t succeeded.

The main problem I have is the format of the matrices, the matrix generated by STFT has a shape of 1x4780000 and the matrix generated by mel is 80 x 201.

If I try to do it without any modification it tells me that it is not possible to multiply a mat1 with mat2.

I am not sure how to process the matrix so that it has a form that allows its multiplication without generating an erroneous result when modifying the multiplication.

mel_spec = self.filters @ magnitudes
mel_spec =( 80,201) @ (1,4780000) (is not possible).

I have been trying a couple of things, following the Olive - ONNX implementation but I can’t fully understand the Slice that it does. I understand that it generates a shape (1,1,4780000) but then it does the gather with index 0 and 1, I guess that the STFT returns another shape.

I have also tried to generate it using this, but in this case I have not been able to execute it correctly in python due to library incompatibility but it seems interesting: GitHub - adobe-research/convmelspec: Convmelspec: Convertible Melspectrograms via 1D Convolutions

If you have any clue as to how I should organize the matrix, I would greatly appreciate it!

Hi there, would it be possible to post the onnx model you have? Thanks.

We now have a Whisper Tiny example on Hugging Face.