Need help creating audio clip from OpenAI TTS API

I am working on a chat game in Unity, and am using OpenAI’s text to speech API to convert the character’s output text to speech. I have made the API call and receive the data back as in a byte format. I am trying to use this in Unity but I just hear static noise when I play it and there’s no errors.

Could I get some help with this?

This is my current code for the method:

public IEnumerator ConvertTextToSpeechOpenAI(string input)
{
    var requestBody = new Dictionary<string, string>
    {
        { "model", "tts-1"},
        { "input", input},
        { "voice", "alloy" }
    };

    string json = JsonConvert.SerializeObject(requestBody);

    UnityWebRequest request = new UnityWebRequest("https://api.openai.com/v1/audio/speech", "POST");
    byte[] bodyRaw = Encoding.UTF8.GetBytes(json);
    request.uploadHandler = new UploadHandlerRaw(bodyRaw);
    request.downloadHandler = new DownloadHandlerBuffer();
    request.SetRequestHeader("Content-Type", "application/json");
    request.SetRequestHeader("Authorization", "Bearer " + apiKey);

    yield return request.SendWebRequest();

    if (request.result == UnityWebRequest.Result.ConnectionError || request.result == UnityWebRequest.Result.ProtocolError)
    {
        Debug.LogError("Error: " + request.error);
    }
    else
    {
        byte[] audioData = request.downloadHandler.data;
        float[] samples = new float[audioData.Length / 4];
        Buffer.BlockCopy(audioData, 0, samples, 0, audioData.Length);
        int channels = 1;
        int sampleRate = 24000;

        AudioClip clip = AudioClip.Create("GeneratedSpeech", samples.Length, channels, sampleRate, false);
        clip.SetData(samples, 0);
        audioSource.clip = clip;
        audioSource.Play();
    }
}

By default OpenAI returns mp3

https://platform.openai.com/docs/guides/text-to-speech/supported-output-formats

using this as the raw data in an AudioClip will not work, as that data is expected to be decoded and PCM. You can change the request you get back to be PCM, I dont know the exact thing you have to put in the request though.

If you want to use mp3 you could save the result you get as a file and use Unity - Scripting API: Networking.UnityWebRequestMultimedia.GetAudioClip which will invoke the audio importer and decode the mp3 into a proper AudioClip

Hi, thanks for helping. I’m ok with using any file format, so I tried it with mp3 and the WebRequestMultiMedia.GetAudioClip that you sent the link to. From what I understand, it is a different type of webrequest that I have to call that prepares the audio clip for me.

When I tried running that method, I get this error:
Error: HTTP/1.1 405 Method Not Allowed

This is the method that I created to use it:

public IEnumerator ConvertTTS(string input)
{
    var requestBody = new Dictionary<string, string>
    {
        { "model", "tts-1"},
        { "input", input},
        { "voice", "alloy" }
    };

    string json = JsonConvert.SerializeObject(requestBody);

    UnityWebRequest request = UnityWebRequestMultimedia.GetAudioClip("https://api.openai.com/v1/audio/speech", AudioType.MPEG);
    byte[] bodyRaw = Encoding.UTF8.GetBytes(json);
    request.uploadHandler = new UploadHandlerRaw(bodyRaw);
    request.downloadHandler = new DownloadHandlerBuffer();
    request.SetRequestHeader("Content-Type", "application/json");
    request.SetRequestHeader("Authorization", "Bearer " + apiKey);

    yield return request.SendWebRequest();

    if (request.result == UnityWebRequest.Result.ConnectionError || request.result == UnityWebRequest.Result.ProtocolError)
    {
        Debug.LogError("Error: " + request.error);
    }
    else
    {
        AudioClip clip = DownloadHandlerAudioClip.GetContent(request);
        audioSource.clip = clip;
        audioSource.Play();
    }
}

Do you know how I could fix that error, or does OpenAI API maybe not allow that type of web request or something?

You were right when you were using a POST request.
UnityWebRequestMultimedia.GetAudioClip creates a GET request and thus it won’t work. An that is the reason of the 405 error.

If you use POST as the initial code I think (I didn’t have tested) you can create the AudioClip in the following manner:

...
yield return request.SendWebRequest();

    if (request.result == UnityWebRequest.Result.ConnectionError || request.result == UnityWebRequest.Result.ProtocolError)
    {
        Debug.LogError("Error: " + request.error);
    }
    else
    {
       var bytes = request.downloadHandler.data;
    }

and use the byte array as mentioned in this SO answer:

Another option would be made the POST request and perform the following modifications in code:

request.downloadHandler = new DownloadHandlerAudioClip(myURL, AudioType.MPEG);

Something along these lines will really help to solve the problem in my opinion

Yes, that follows the way I had it originally. I think I am doing the web request part right as I am getting sounds on output, but its just all static noise so I think im just not converting the byte into audio clip correctly. Could I get some help with that part?

This is my current code for it (tried using the solution from SO this time):

    public IEnumerator ConvertTextToSpeechOpenAI(string input)
    {
        var requestBody = new Dictionary<string, string>
        {
            { "model", "tts-1"},
            { "input", input},
            { "voice", "alloy" }
        };

        string json = JsonConvert.SerializeObject(requestBody);

        UnityWebRequest request = new UnityWebRequest("https://api.openai.com/v1/audio/speech", "POST");
        byte[] bodyRaw = Encoding.UTF8.GetBytes(json);
        request.uploadHandler = new UploadHandlerRaw(bodyRaw);
        request.downloadHandler = new DownloadHandlerBuffer();
        request.SetRequestHeader("Content-Type", "application/json");
        request.SetRequestHeader("Authorization", "Bearer " + apiKey);

        yield return request.SendWebRequest();

        if (request.result == UnityWebRequest.Result.ConnectionError || request.result == UnityWebRequest.Result.ProtocolError)
        {
            Debug.LogError("Error: " + request.error);
        }
        else
        {
            byte[] audioData = request.downloadHandler.data;
            float[] f = ConvertByteToFloat(audioData);

            AudioClip clip = AudioClip.Create("GeneratedSpeech", f.Length, 1, 24000, false);
            clip.SetData(f, 0);
            audioSource.clip = clip;
            audioSource.Play();
        }
    }

    private float[] ConvertByteToFloat(byte[] array)
    {
        float[] floatArr = new float[array.Length / 4];
        for (int i = 0; i < floatArr.Length; i++)
        {
            if (BitConverter.IsLittleEndian)
                Array.Reverse(array, i * 4, 4);
            floatArr[i] = BitConverter.ToSingle(array, i * 4) / 0x80000000;
        }
        return floatArr;
    }

By using SetData with the mp3 response you are getting you are trying to play the raw mp3 file - this is encoded data and will just sound like noise. The mp3 must be decoded first OR you might be able to request a PCM response (which is “ready to go” for SetData) from OpenAI with this in your request body

            { "response_format", "pcm" }

This PCM response is 24kHz, 16-bit signed so make sure you set that sample rate when creating the clip. you will need to convert from 16 bit to float, with something like

        floatOutput[i] = (float)BitConverter.ToInt16(byteInput, i * 2) / short.MaxValue;

Sadly I can not test this as I do not have OpenAI tokens!

For receiving an mp3 result, you will need to lean on GetAudioClip as said before, this will do the decoding. Checkout what mapluisch has done with their package (and potentially just use their package!) https://github.com/mapluisch/OpenAI-Text-To-Speech-for-Unity/:

here they get the response and return it as bytes OpenAI-Text-To-Speech-for-Unity/Assets/Scripts/Core/OpenAIWrapper.cs at main · mapluisch/OpenAI-Text-To-Speech-for-Unity · GitHub

and then they save it to a local file OpenAI-Text-To-Speech-for-Unity/Assets/Scripts/Core/AudioPlayer.cs at main · mapluisch/OpenAI-Text-To-Speech-for-Unity · GitHub

then use GetAudioClip to import that file as an audio clip, which decodes it OpenAI-Text-To-Speech-for-Unity/Assets/Scripts/Core/AudioPlayer.cs at main · mapluisch/OpenAI-Text-To-Speech-for-Unity · GitHub

I see. In my use case, I will be converting text to speech very frequently, so would saving and reading from a local file be expensive to do so many times? (please correct me if i’m wrong :slight_smile: )

I tried it the first way as I can get pcm from the OpenAI request. I tried to follow your method for converting the 16 bit to float but am running into the following error:
ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: startIndex

This is the code i’m using:

public IEnumerator ConvertTextToSpeechOpenAI(string input)
{
    var requestBody = new Dictionary<string, string>
    {
        { "model", "tts-1"},
        { "input", input},
        { "voice", "alloy" },
        { "response_format", "pcm" }
    };

    string json = JsonConvert.SerializeObject(requestBody);

    UnityWebRequest request = new UnityWebRequest("https://api.openai.com/v1/audio/speech", "POST");
    byte[] bodyRaw = Encoding.UTF8.GetBytes(json);
    request.uploadHandler = new UploadHandlerRaw(bodyRaw);
    request.downloadHandler = new DownloadHandlerBuffer();
    request.SetRequestHeader("Content-Type", "application/json");
    request.SetRequestHeader("Authorization", "Bearer " + apiKey);

    yield return request.SendWebRequest();

    if (request.result == UnityWebRequest.Result.ConnectionError || request.result == UnityWebRequest.Result.ProtocolError)
    {
        Debug.LogError("Error: " + request.error);
    }
    else
    {
        byte[] audioData = request.downloadHandler.data;
        float[] samples = Convert16BitToFloat(audioData);
        AudioClip clip = AudioClip.Create("ConvertedSpeech", samples.Length, 1, 24000, false);
        clip.SetData(samples, 0);
        audioSource.clip = clip;
        audioSource.Play();
    }
}

private float[] Convert16BitToFloat(byte[] data)
{
    float[] samples = new float[data.Length];

    for (int i  = 0; i < data.Length; i++)
    {
        samples[i] = (float) BitConverter.ToInt16(data, i * 2) / short.MaxValue;
    }

    return samples;
}

Do you know what I may need to change to do the 16 bit to float conversion properly?

I ended up writing a whole new StreamAudioSource component to stream audio correctly with super low latency using OnFilterRead

Check it out in my OpenAI FOSS library:

There’s several sample scenes for Chat, Assistants, as well as Realtime Conversations.

1 Like