Github App Repo: Whisper - StableDiffusion locally run

Hey all

I wanted to share with our community a technical summary blog post and github repo for a Unity app with locally run AI models that I hope may help some.

I am beginner and avid learner of AI, and paired with Unity and Sentis it’s a ton of fun ! But my learning experience with this tech was initially slow when I faced new concepts + beta software.

I found that fully-working examples using other AI libraries in Unity combined with tensorflow/etc documentation, helped me understand how these AI models are supposed work and implemented.

In particular I found that microsoft’s onnxruntime library examples on the internet/repos gave me a lot clarity to implement these in Unity and got fast results.

The Unity prototype app (nicknamed Talkomic - as in Chat and create a Comic of images) implements locally-run whisper model and stable diffusion algorithm in u-net architecture on microsoft’s onnxruntime libraries. The trio of AI models running on Unity transcribe a podcast audio file to text and generate contexual images closely tied to the transcribed text.

You can find more information here:


Thanks for sharing with the community, this is very interesting :slightly_smiling_face:

1 Like

Very interesting that they got it to work, even though it’s Editor only for now

1 Like

From the repo I see dependencies on Microsoft.ML.OnnxRuntime* but not on Sentis, did you end up using it for this project?

Hey @WendelinReich1 the github repo posted is 100% onnxruntime implementation. I tried to explain my approach to learn/implement Sentis in Unity, which may be a helpful to others who are also beginners: working projects implemented in Unity in other libraries (onnxruntime) have provided me a wealth of guidance to understand how to implement Sentis in Unity.

Yes, in a separate project I successfully run Sentis as proof of concept. However I patched the bits that didn’t work with Sentis with onnxruntime. I reported these on this forum and Unity. In particular:

  • Unet, text encoder, vae decoder models make use of the Sentis library

  • Whisper-tiny and the cliptokenizer models use onnxruntime. This is because currently Sentis doesn’t support string tensor, which is the input type for cliptokenizer and output type for whisper. Unfortunately I don’t have the time to implement this in c# as kindly suggested by @alexandreribard_unity

  • errors are reported in Unity for some onnx models unless these are stored in the Streaming assets folder

Hope that helps you

1 Like

Totally, thanks for the detailed answer!

The thing we’re all asking ourselves (I guess) is what kind of performance improvement we can get by using Sentis in such situations realistically, but that’s probably too early to answer.


could you detail what do you mean by this?

Hi @alexandreribard_unity absolutely, you may have actually seen this document, I sent it over 3 weeks ago. To avoid duplication may I suggest @Bill_Cullen shares both the sentis-unity project and the pdf document outlining blockers and issues?

As always I remain available to clarify anything further, though I may be slow to respond as I will be travelling for the next 10 days.

Have a great weekend!

1 Like

Hey @WendelinReich1 I’ve tested a build on Windows 11 and it works fine :+1:

I’ve opened a branch for the prototype to also build and deploy, it’s work in progress, but I’ll take me sometime as I won’t be around for a few.



Yes, your document was reviewed the same day and is known internally as issues 108, 109, and 110. Thank you very much for your time and effort :slightly_smiling_face:

1 Like