How to use a TTS model with Unity Sentis?

Hello folks,

I’d like to add Text To Speech feature in a project - I was wondering if Unity Sentis can be used to achieve this goal by using a model coming from Hugging Face and how ?

Best,

1 Like

but model is not that great.

I’m also interested in a lightweight, local (preferably), fast model (for mobile). Did you have any luck in your search ?

You could try to convert massively multilingual to ONNX. I am not sure about local performance, but when running on the cloud the quality is quite alright

Hi!

Working on a game with a state-based structure? This might be exactly what you need:

It’s voice-acting with emotions not just TTS!
Generated in real-time on device (no cloud) …works on old mobiles as well.

Looking for people to test it out right now!

Hi osgseb,

Unity Sentis is great for running lightweight ONNX models inside Unity, but full Text-to-Speech (TTS) pipelines are usually too complex for it. Most Hugging Face TTS models like Tacotron2 or Bark involve audio synthesis steps that require GPU and specialized operations not supported by Sentis. Even if you convert a model to ONNX, it’s unlikely to run efficiently or at all within Unity.

A more practical approach is to handle TTS outside Unity. You can run a Hugging Face TTS model in Python using libraries like transformers or TTS, convert text into audio, and then stream the resulting audio into Unity. This can be done by setting up a local server with Flask or FastAPI, which Unity calls to get the audio in real time.

In short, Sentis isn’t built for full audio generation tasks. It’s better to keep the TTS processing external and let Unity handle playback only. This way, you get high-quality speech and a smooth Unity experience.

Regards