Is there any way to isolate and modify the volume for generated Text to Speech independently of other speech volumes (using the Core SDK)? We’re experiencing distorted voice for the generated voice but I can’t work out a way of reducing the generated voice volume without affecting other voice coming from the user.
We don’t currently have independent volume control for TTS, it’s one of several improvements we’d like to add to the feature in the future. At what volume levels are you starting to hear distortion?
Confirming what Nick said, this is something we’d like to add. The only way currently would be to render text-to-speech to a buffer, multiply the audio samples to adjust the volume, then inject that audio yourself or render it yourself. You would unfortunately lose the benefits of the TTS queueing system.
To get the TTS into an audio buffer use vx_tts_speak_to_buffer
Although we have an audio injection request, it currently only works with WAV files , so instead of using
vx_req_sessiongroup_control_audio_injection
, you would have better results using the audio callbacks and mixing in the TTS. Look into pf_on_audio_unit_before_capture_audio_sent
to mix in audio after capture audio processing or look into pf_on_audio_unit_after_capture_audio_read
to mix in audio before capture audio processing.
Thanks for that, before I dig in - is the raw audio data from the vx_tts_speak_to_buffer in the correct format to replace the buffer in the pf_on_audio_unit… callbacks (save for factoring in the sample rates)?
Yup, it’s mono 16-bit audio samples. If the sample rates match you should be able to copy it straight into the buffer that pf_on_audio_unit...
callbacks provide. If you want to keep the microphone data that was captured you will have to mix the TTS audio in with the pre-existing audio in that buffer provided by the callback.
If the sample rates don’t match then you will have to resample the TTS audio before mixing it in. EDIT: and as I recall, that is almost always the case. TTS audio is probably 16 kHz while our default settings for capture and render are 48 kHz.
Thanks fore the responses -
I’m running into a bit of a problem using the suggested work around - using pf_on_audio_unit_after_capture_audio_read to inject the pcm data from vx_tts_speak_to_buffer seems to work fine for the audio other players receive but not for audio that would have been heard locally (using tts_dest_screen_reader).
I still get the callback allowing me to inject the buffered audio but there is no audible output. If I use
vx_tts_speak(TTSManager, VoiceId, textToVoice, tts_dest_screen_reader, &utteranceID); and then modify the output in pf_on_audio_unit_after_capture_audio_read I can hear the effect
It’s almost as though somewhere there is a check for silence and the audio is being ‘lost’.
The debug suggests that there is Vivox voice happening (none zero ‘energy’ values).