I am working on a chat game in Unity, and am using OpenAI’s text to speech API to convert the character’s output text to speech. I have made the API call and receive the data back as in a byte format. I am trying to use this in Unity but I just hear static noise when I play it and there’s no errors.
using this as the raw data in an AudioClip will not work, as that data is expected to be decoded and PCM. You can change the request you get back to be PCM, I dont know the exact thing you have to put in the request though.
Hi, thanks for helping. I’m ok with using any file format, so I tried it with mp3 and the WebRequestMultiMedia.GetAudioClip that you sent the link to. From what I understand, it is a different type of webrequest that I have to call that prepares the audio clip for me.
When I tried running that method, I get this error:
Error: HTTP/1.1 405 Method Not Allowed
You were right when you were using a POST request.
UnityWebRequestMultimedia.GetAudioClip creates a GET request and thus it won’t work. An that is the reason of the 405 error.
If you use POST as the initial code I think (I didn’t have tested) you can create the AudioClip in the following manner:
Yes, that follows the way I had it originally. I think I am doing the web request part right as I am getting sounds on output, but its just all static noise so I think im just not converting the byte into audio clip correctly. Could I get some help with that part?
This is my current code for it (tried using the solution from SO this time):
public IEnumerator ConvertTextToSpeechOpenAI(string input)
{
var requestBody = new Dictionary<string, string>
{
{ "model", "tts-1"},
{ "input", input},
{ "voice", "alloy" }
};
string json = JsonConvert.SerializeObject(requestBody);
UnityWebRequest request = new UnityWebRequest("https://api.openai.com/v1/audio/speech", "POST");
byte[] bodyRaw = Encoding.UTF8.GetBytes(json);
request.uploadHandler = new UploadHandlerRaw(bodyRaw);
request.downloadHandler = new DownloadHandlerBuffer();
request.SetRequestHeader("Content-Type", "application/json");
request.SetRequestHeader("Authorization", "Bearer " + apiKey);
yield return request.SendWebRequest();
if (request.result == UnityWebRequest.Result.ConnectionError || request.result == UnityWebRequest.Result.ProtocolError)
{
Debug.LogError("Error: " + request.error);
}
else
{
byte[] audioData = request.downloadHandler.data;
float[] f = ConvertByteToFloat(audioData);
AudioClip clip = AudioClip.Create("GeneratedSpeech", f.Length, 1, 24000, false);
clip.SetData(f, 0);
audioSource.clip = clip;
audioSource.Play();
}
}
private float[] ConvertByteToFloat(byte[] array)
{
float[] floatArr = new float[array.Length / 4];
for (int i = 0; i < floatArr.Length; i++)
{
if (BitConverter.IsLittleEndian)
Array.Reverse(array, i * 4, 4);
floatArr[i] = BitConverter.ToSingle(array, i * 4) / 0x80000000;
}
return floatArr;
}
By using SetData with the mp3 response you are getting you are trying to play the raw mp3 file - this is encoded data and will just sound like noise. The mp3 must be decoded first OR you might be able to request a PCM response (which is “ready to go” for SetData) from OpenAI with this in your request body
{ "response_format", "pcm" }
This PCM response is 24kHz, 16-bit signed so make sure you set that sample rate when creating the clip. you will need to convert from 16 bit to float, with something like
floatOutput[i] = (float)BitConverter.ToInt16(byteInput, i * 2) / short.MaxValue;
Sadly I can not test this as I do not have OpenAI tokens!
For receiving an mp3 result, you will need to lean on GetAudioClip as said before, this will do the decoding. Checkout what mapluisch has done with their package (and potentially just use their package!) https://github.com/mapluisch/OpenAI-Text-To-Speech-for-Unity/:
I see. In my use case, I will be converting text to speech very frequently, so would saving and reading from a local file be expensive to do so many times? (please correct me if i’m wrong )
I tried it the first way as I can get pcm from the OpenAI request. I tried to follow your method for converting the 16 bit to float but am running into the following error:
ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: startIndex