Voice Recognition

Hey guys!

I am creating a small game that uses Windows 10 / Unity voice recognition.

Namely, the KeywordRecognizer. I have it working quite well - but am stumped on something rather specific that I am trying to do. Basically, I am trying to have it so there are single keywords that can be detected but during say a normal sentence. So let’s say the word I want to detect is ‘Apple’ - The player says something like ‘Move the apple’ - it never registers because it is picking up ‘move the apple’ as the entire phrase - however if I just say ‘apple’ it works great. I am trying to compensate for players saying the wrong things but also having the correct word.

Another approach I have taken was I make the keyword element ‘Move the apple’ this works great. The problem is if say the next command is ‘Move the pear’… if the player says ‘Move the apple, move the pear’ - this will only ever pick up the first part - ‘move the apple’ and the second part is ignored because the KeywordRecognizer was already busy parsing the contents of the first part and cannot listen while it is figuring out what was said.

So one would surmise, let’s make the keyword element ‘move the apple move the pear’ - yes that would work… the problem now with that is… what if I as the player don’t want to move them both.

Basically, this is all very abstracted from what I am actually doing with my game - but I am trying to figure out a way to make this more responsive and dynamic.

Thanks guys!

I haven’t used Unity voice recognition before, and a quick search didn’t turn up any documentation specifically addressing this issue, but I would have expected something called KeywordRecognizer to recognize its trigger word even if it occurs as part of a longer phrase. My first guess would be that it isn’t intentionally filtering out longer phrases, but simply has more difficulty in correctly recognizing the word when there’s more “noise” around it. Have you tried turning down the confidence threshold to see if you can get it to pick “apple” out of “move the apple”?

If I’m wrong and KeywordRecognizer only checks whether an entire phrase (separated by long pauses) matches it’s keywords, then you might try using the DictationRecognizer instead to turn everything the user says into strings and then search the strings for the words you’re interested in.

An alternative approach would be to require the user to explicitly signal when a new command starts or stops. For instance, the game In Verbis Virtus requires the user to hold down a button while speaking a command (“magic spell”) and release that button when finished; this allows you to ignore all the audio before and after what the user believes is important. Not sure how easy it is to add that sort of pre-processing in front of Unity’s speech classes, but in principle, it’s one way of addressing the separation problem.

Yes that is what I was expecting as well. I have been using a Low confidence which the only real noticeable thing from that has been now I can say words that rhyme with the keyword(s) but I still can’t have it pick out a single keyword from a sentence.

I was looking into Dictation as well but I’m not sure if that is something which would be needed. I also was struggling to find a good example of it in use.

Button press to signal the start of the KeywordRecognizer might not be a bad idea. I guess my issue now is, let’s say my keywords are ‘The apple is red’ and ‘The pear is green’ - If I say that, once it picks up the first phrase there is lag time between that being registered and listening for the second phrase and it ultimately will always miss the second phrase. So I guess I am looking for some slick way to always have it listening but also parsing in the background for words which have been said.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.Windows.Speech;
using System.Linq;

public class TestSpeech : MonoBehaviour
{

    private KeywordRecognizer keywordRecognizer;
    private Dictionary<string, System.Action> keywords = new Dictionary<string, System.Action>();
    private List<string> words = new List<string>();
    private string currentWord;


    // Use this for initialization
    void Start()
    {

        words.Add("The apple is red");
        words.Add("The pear is green");


        for (int i = 0; i < words.Count; i++)
        {
            keywords.Add(words[i], () =>
            {
                KeywordCalled();
            });
        }

        keywordRecognizer = new KeywordRecognizer(keywords.Keys.ToArray(), ConfidenceLevel.Low);
        keywordRecognizer.OnPhraseRecognized += KeywordRecognizerOnPhraseRecognized;

        if (!keywordRecognizer.IsRunning)
        {
            keywordRecognizer.Start();
        }

    }

   
    void KeywordRecognizerOnPhraseRecognized(PhraseRecognizedEventArgs args)
    {
        System.Action keywordAction;

        if (keywords.TryGetValue(args.text, out keywordAction))
        {
            currentWord = args.text;
            keywordAction.Invoke();
        }

    }

    void KeywordCalled()
    {
        print("You just said: " + currentWord);
    }

}

Add this to any object in your Scene (assuming you are running Windows 10) and you can see what I am talking about. Try saying them back to back. It doesn’t always get them.