Automatic Speech recognition using Unity Sentis

I want to create an Automatic Speech Recognition implementation in unity. I want to use the unity sentis for creating this implementation. The main objective of this implementation is that the speech recognition must work in offline and should use any buttons to trigger the recognition AI and the target platform is android. Can anyone help me with this implementation?

Have you tried our Whisper demo to get you started?

Hi Paul,
Thanks for your assistance. I downloaded the model that you have mentioned earlier and followed the instructions present in the model card. But Once I click on the play button, I am getting a message that “All compiler errors needs to be fixed before entering play mode”. But I don’t even have any compiler errors in the first place. I have checked the console panel thoroughly. Kindly help me in this regard.

Hi Paul,
I some how figured out the issue and now the audio input is getting transcribed to text without any issues. But what I need to carryout is to transcribe the user’s audio in real-time without clicking any buttons and compare the transcribed word with a set of words. The set of words here refer to the commands. So when the transcribed text matches any of the commands in the list, it should carry out some actions like switching scenes etc. Kindly help me with this implementation.

Funny you should mention this. But Thomas from Hugging Face has just put up a tutorial for almost exactly this thing.

Hi Paul,
Thanks for suggesting this one. It came very handy and I was able to create a speech recognition tool with the provided instructions. But I want this employ the speech recognition without clicking any buttons. I am also trying to use the transcribed text as commands to navigate through my game. Can you provide your suggestions for this implementation?

Did you see this link to this demo? I think it does what you want? It is a robot that responds to voice commands.

To respond to continual audio without buttons you would want to have a look at streaming audio from a microphone. Then you’ll generally want to detect when the volume of the mic goes above a certain threshold to get the start of the recording.

Hi Paul,
I watched the demo that Thomas has made completely. It seems he employs button clicks to achieve the speech recognition. I’ll try the method you just suggested and will let you know the results.

Hi Paul,
Tried using the method you suggested. But it doesn’t turnout well. It is not recognizing the audio. Can you provide any other alternate methods or instructions to achieve this implementation?

Hi @kiranmgcv. One suggestion is that the Whisper model won’t work if there is silence at the beginning of the audioClip. Could this be an issue? Also the audio needs to be at 16000 Hz when recording from the Mic in Mono. Perhaps are you able to provide an audio clip we can test? Or is it the microphone implementation you are having trouble with?

Noticed how the model is in sentis format, is there a tutorial or script where the model is in onnx format so that I can try other ASR models too

You can convert any ONNX to Sentis format which is advisable for large files.

Alternatively just use:

public ModelAsset yourONNXmodel;

var model = ModelLoader.Load(yourONNXmodel); 

and drop the ONNX onto the field.