Using Speechlib from SAPI (Microsoft text to speech API) as an AudioSource

Im building an app wich has a chatbot and uses SAPI for text to speech along with SALSA asset for lypsync.
What i’m trying to acomplish is to create a live AudioSource that feeds directly from TTS audio output.
I have succesfully acomplished this thru saving into wav files for each sentence and then loading the wav files in runtime to the gameobject that has the lypsync etc. This works, but the continious loading of wav files makes the app be slow, freeze each time it does that and even crash.

I know it’s possible to make a live audiosource from a microphone on the computer. So what I want to do is something like that.

I tried what from my naive level of programmer would be the logic way. Simply connect the audiooutput stream from the TTS as a audisource audio clip, like this:

TTSvoice.AudioOutputStream = AudioSource.clip;
and get this error:

error CS0029: Cannot implicitly convert type UnityEngine.AudioClip' to SpeechLib.ISpeechBaseStream’`SpeechLib.ISpeechBaseStream’

I know in Python you can connect audio objects from different libraries thru numpy converting audio to a standard raw array data. But I’m also kinda new to C# and Unity.

here’s my code:

using UnityEngine;

using System.Collections;
using SpeechLib;
using System.Xml;
using System.IO;
using System;
using System.Diagnostics;

public class controller : MonoBehaviour {

    private SpVoice voice;
    public AudioSource soundvoice;

    // Use this for initialization
    void Start () {

        voice = new SpVoice();

        GameObject character = GameObject.Find("character");
        soundvoice = character.GetComponent(typeof(AudioSource)) as AudioSource;

        voice.AudioOutputStream = soundvoice.clip;
	// Update is called once per frame
	void Update () {



    IEnumerator talksome() {
        while (true)
            string sentence = "counting " + counter;

            yield return new WaitForSeconds(2);

hi. you found any solution for this? @carlitoselmago
thanks. g.

if your trying to go from Microphone to Avatar, then you should know TTS includes the vismeme callbacks so that lip syncing seems more natural in apps that can pull that information (or .Net)

However, your live voice doesn’t include this when it is read from an audiosource such as a microphone. In short, you really can’t accurately do what you are asking, unless you have some crazy code that is capturing your voice, speech to texting it, deciphering phenoms to vismeme callbacks, and then sending the info to your other script that controls those vismeme blends.

That would be some task, since with what i’m doing, its about 1/2 delay to decipher what i have said, to look it up in the dictionary, the preform the associated task. And i got a pretty boss machine as well. (all SSD, 8 core, 16 gb, 970 Asus etc…) so its really how fast you can transpose your voice to text and then to the appropriate blends. Good luck, if anyone gets this to work in a quick manner, i’m sure we would all appreciate seeing the code! Hope this helps.

//pardon my geekspeek as I really don’t know all the proper TTS terms, but i think this covers it.