Multiplayer Voice Chat

Assume I already have Unity Pro, what would be the best approach to making a voice chat in my multiplayer game.

For standalone, write a plugin that interfaces with an existing voice chat solution like TeamSpeak or Ventrillo and put up a corresponding server for example

For webplayer, the only option is to handle it with flash or java, you can’t do it in unity

I don’t think that ventrilo is really an option, as it’s somewhat of a blackbox server technology. Mumble, the open-source option, is similar to ventrilo.

However Teamspeak has an SDK tailored for game integration.

They all use SpeeX as a codec, which sounds fine at higher bitrates. Search google for that and another product will likely pop up for you.

Teamspeak has a bad rap as their server owners are stingy and use low quality settings and bitrate to save on bandwidth costs. Reaction to those 200 user corp servers in Eve I guess.

If anybody’s interested, I have a voice chat solution for Unity that uses the Speex codec Quietus mentioned, and is fully integrated into Unity (fully managed as well, which means it should work pretty much on all platforms - even Indie builds)
Check my sig for a link.

Hi, PhobicGunner,
I’m in a project which needs the voice chatting. Can you please tell me how you did it? I tried a methods before, but the quality of voice is too bad with that method.

Thank you so much

There’s a lot of components working together here, so I’ll try to explain a general overview.

The biggest piece of the puzzle is audio playback. Just playing audio bits as they arrive naively will result in highly glitchy audio. You can clump several clips together, but that will increase latency as it has to wait for some number of samples to arrive.
The best method I’ve found, and am currently using, is to have one audio clip for playback, which is essentially a circular buffer of sorts. As you receive audio, you write it to this audio clip (whenever write hits the end of the clip, it wraps around and starts writing to the beginning). This audio clip is played on a loop, resulting in glitch-free audio. You have to keep track of the play position, because if it ever exceeds the actual audio data received, you have to stop playing and wait for more audio data to arrive.

The next piece of the puzzle is recording and encoding audio data. Every frame I get the audio data in between the last read position, and the current read position. Here’s the important part when working with Speex! Speex needs audio in specific multiples of samples, otherwise it will break (with Narrow mode, it needs multiples of 320, with Wide it needs multiples of 640, and with UltraWide it needs multiples of 1280). So I go through the newly recorded samples, splitting them into chunks of appropriate size as required by Speex. These chunks are then buffered, later fed through the Speex codec, and finally sent over the network.

Thanks a lot. Basically, I think I did the same thing with you except that I didn’t do the encode and decode. You mentioned that playing audio clip on a loop will result in glitch-free audio. I did it on a loop, but whenever I start to talk, there is an echo accumulation. If I speak a long sentence, the other side can’t hear me clearly after two seconds because of the echo accumulation.
Here is my code:

public class VoiceStream : uLink.MonoBehaviour {

int pos = 0;
int lastSample = 0;
int diff = 0;
int FREQUENCY = 30000;
bool recordPressed = false;
int minFreq;
int maxFreq;
AudioSource ChatAudioSource;
AudioClip tempClip; //this clip stores the realtime data from microphone
bool isChatting = false;

// Use this for initialization
void Start () {
Microphone.GetDeviceCaps(null, out minFreq, out maxFreq);
if(minFreq == 0 maxFreq == 0)
{
maxFreq = FREQUENCY;
}
}

void OnGUI()
{
if(GUI.Button(new Rect(10,10,100,20),“start chatting”))
{
isChatting = true;
if(!recordPressed)
{
recordPressed = !recordPressed;
if(uLink.Network.peerType == uLink.NetworkPeerType.Server)
{
tempClip = Microphone.Start(null,true,5,maxFreq);
}
else
{
tempClip = AudioClip.Create(“ClientTest”,FREQUENCY,1,FREQUENCY,false,false);
audio.clip = tempClip;
audio.Play();
}
}

}
if(GUI.Button(new Rect(10,40,100,20),“stop chatting”))
{
isChatting = false;
if(recordPressed)
{
recordPressed = !recordPressed;
Microphone.End(null);
}
}

if(GUI.Button (new Rect(10,70,100,20),“play”))
{
audio.PlayOneShot(tempClip,1.0f);
}
}

void uLink_OnSerializeNetworkView(uLink.BitStream stream, uLink.NetworkMessageInfo info)
{
if(isChatting)
{
if(stream.isWriting)
{
pos = Microphone.GetPosition(null);
if(pos < lastSample)
{
lastSample = 0;
}
diff = pos-lastSample;
float[ ] samples = new float[diff*tempClip.channels];
tempClip.GetData(samples,lastSample);
stream.Write(samples.Length);
stream.Write<float[ ]>(samples);
lastSample = pos;
}
else
{
Debug.LogWarning(“I’m reading”);
int len = stream.Read();
float[ ] samples = new float[len];
samples = stream.Read<float[ ]>();
tempClip.SetData(samples,0);
audio.PlayOneShot(tempClip,1.0f);
pos = pos + len;
}
}
}

}

Do you see any problems here?
And if it’s possible, can you please give me more details about how to do Speex?

Sorry that I have so many questions.
Thanks a lot.

I’m willing to bet that’s your problem (the bolded text). You’re doing two things - creating a looping clip and playing that, AND playing it one shot every time you receive audio data. That’s probably what results in the echo.
Speex encoding is actually real easy. Just grab the NSpeex library, set it to .NET 3.5 and rip out all of the bits which causes errors (I think there’s some attributes in there which don’t exist in .NET 3.5, I just ripped those out and it works fine). Then you can pretty much just do something like this (in this case I’m assuming Wide bandmode):

private NSpeex.SpeexEncoder m_wide_enc = new NSpeex.SpeexEncoder( NSpeex.BandMode.Wide );

...

byte[] encoded = new byte[ 640 ]
int length = m_wide_enc.Encode( input, 0, input.Length, encoded, 0, encoded.Length );
// where 'input' is an array of shorts, each short is one 16-bit sample

And decoding

private NSpeex.SpeexDecoder m_wide_dec = new NSpeex.SpeexDecoder( NSpeex.BandMode.Wide );

...

short[] decoded = new short[ 640 ];
m_wide_dec.Decode( inputBytes, 0, dataLength, decoded, 0, false ); // <-- dataLength is the length returned by the encoder (you'll want to pack this into your data structure somehow)

Hi PhobicGunner
Yes, your suggestion is right. I deleted one of the “play”, then the voice is better.
But I still have one question about the NSpeex. It encodes and decodes the short[ ] data. But the audio sample is actually float[ ]. How do you handle the conversion between them?
In the example of the documentation, it seems like this:
short[ ] data = new short[e.BytesRecorded / 2];
Buffer.BlockCopy(e.Buffer, 0, data, 0, e.BytesRecorded);
I don’t know how this helps with the conversion.
Could you give me some hint?

Thanks again

You basically have to convert each sample. In Unity, it’s a float per sample, in the range [-1, 1]. But Speex expects a short per sample, in the range [0, short.MaxValue].
You’ll have to perform some basic math on each sample. Off the top of my head:

// assuming 'sample' is a float value between -1 .. 1
sample += 1f; // now it's in the range 0 .. 2
sample *= 0.5f; // now it's in the range 0 .. 1

short val = (short)Mathf.FloorToInt( sample * short.MaxValue );

For everyone who still need how to compress or encode audioClip data (for example to save or send microphone recording) i made some clear code thanks to all this thread:

Note. is not elegant code :wink: but a starter point.

    private void TestAudioClipCompression(AudioClip _audioClip)
    {
        int recordedFrecuency = _audioClip.frequency;

        int dataLength;
        var encoded = EncodeToSpeex(_audioClip, out dataLength);

       // File.WriteAllBytes("C:/WavTest/encondedWide.spx", encoded); // TODO: also storage dataLength and recordedFrecuency

        AudioClip a = DecodeFromSpeex(encoded, dataLength, recordedFrecuency);

        GetComponent<AudioSource>().clip = a;
        GetComponent<AudioSource>().Play();
    }

    private AudioClip DecodeFromSpeex(byte[] encoded, int dataLength, int frequency)
    {

        NSpeex.SpeexDecoder m_wide_dec = new NSpeex.SpeexDecoder(NSpeex.BandMode.Wide);
        short[] decoded = new short[encoded.Length];
        m_wide_dec.Decode(encoded, 0, dataLength, decoded, 0, false);
        float[] result = new float[encoded.Length];
        int t = 0;
        while (t < decoded.Length)
        {
            short sample = decoded[t];
            float floatSample = sample/(float) short.MaxValue;
            floatSample *= 2f;
            floatSample -= 1f;

            result[t] = floatSample;
            t++;
        }


        //int RecordedFrequency = 22000;
        int recordedFrequency = frequency;
        Debug.Log("recordedFrequency " + recordedFrequency);
        AudioClip a = AudioClip.Create("VoiceChat", result.Length, 1, recordedFrequency, false);

        // Set data
        a.SetData(result, 0);
        return a;
    }

    private byte[] EncodeToSpeex(AudioClip _audioClip, out int dataLength)
    {
        short[] samplesShort = new short[_audioClip.samples*_audioClip.channels];
        float[] samplesFloat = new float[_audioClip.samples*_audioClip.channels];
        // assuming 'sample' is a float value between -1 .. 1

        int sizeChunkNorris = 640*(Mathf.FloorToInt(samplesFloat.Length/640f) - 1);
        _audioClip.GetData(samplesFloat, 0);

        int i = 0;
        while (i < samplesFloat.Length)
        {
            float sample = samplesFloat[i];
            sample += 1f; // now it's in the range 0 .. 2
            sample *= 0.5f; // now it's in the range 0 .. 1
            short sampleShort = (short) Mathf.FloorToInt(sample*short.MaxValue);

            samplesShort[i] = sampleShort;
            ++i;
        }

        short[] inputPartChunk = new short[sizeChunkNorris]; // chunk of multiple of 640
        int y = 0;
        while (y < sizeChunkNorris)
        {
            inputPartChunk[y] = samplesShort[y];
            y++;
        }

        NSpeex.SpeexEncoder m_wide_enc = new NSpeex.SpeexEncoder(NSpeex.BandMode.Wide);

        byte[] encoded = new byte[sizeChunkNorris];
        dataLength = m_wide_enc.Encode(inputPartChunk, 0, inputPartChunk.Length, encoded, 0, encoded.Length);
        // where 'input' is an array of shorts, each short is one 16-bit sampl
        return encoded;
    }

Best,

Sebastian V.