Automatically capturing user speech in Meta Quest 3

Hi :slight_smile:

I’m creating an app whereby a user chats to an NPC, the NPC is powered by Open AI. I want my app to automatically detect when the user speaks, and do stuff with the microphone input (send the audio to openai for speech to text transcription etc) and detect when the speech has stopped.

Meta’s Wit AI can capture mic audio and transcribe it but it offers no automatic voice detection feature, you have to press a key/button first to let it know you’re speaking, I don’t want that. Can anyone point me in the direction of what I want i.e. an existing software solution etc?

1 Like

Maybe read out the decibels of the mic?
Start when it goes over a threshold, then stop when it’s under a threshold for a few seconds.

This is why most smart assistants use a phrase like hey google to start recording

1 Like

thx for the suggestion, I tried that, it worked sometimes but not reliably. I found a python library that does pretty good voice detection, i ran it’s code in a websocket server connected to my unity app. bit messy but it works.

2 Likes

@Tyke18 is this still working or have you come up with something better? Also:

  • Can you share which Python library helped?
  • Does Wit AI offer speech to text transcription too, or did you find that Open AI was the only solution?

@Tyke18 @FarmerInATechStack I have the same question with FarmerInATechStack
I try to use Azure stream speech to text to do the transcription.
And I also need the automatic voice detection so that it can stop crorrectly?
I also wanna ask

  • Can you please point me to which Python library helped?
  • (Same question) Does Wit AI offer speech to text transcription too, or did you find that Open AI was the only solution?

@carton22liu_unity I’ve gotten text-to-speech working using the OpenAI options and some scripts for microphone recording. However, I’m not doing automatic speech detection. I press a button to start recording from the mic.

If interested, I’m also on Discord at farmerinatechstack

This is what I use, it’s cheap ($60), effective, fast, local, and will even give you a breakdown of when most of the syllables are spoken; works best with English but like 20-30 total languages work at worse quality: Undertone - Offline Whisper AI Voice Recognition | AI-ML Integration | Unity Asset Store

1 Like

Nice, you can probably skip that if you’re up for just using the APIs directly but it can also be really nice to have a solution that “just works” and someone else maintains.