What is the point of Vivox vs. WebRTC? (for purely audio comms)

I am trying to figure out the best/cheapest/simplest technology for multiplayer voice communication.

As I understand, Unity WebRTC allows for direct peer to peer audio connections. One could presumably scale the volume of audio connected players or mute/adjust them based on proximity or other factors based on gameplay in Unity via script once connections are made.

So if the only need is to get an audio connection between your users, what is the point of something like Vivox?

Specifically on a tech level for audio chat what is Vivox doing? Is it just facilitating peer to peer WebRTC connections just like I described? Or is it sending all the audio to a single server and the server is mixing the individual streams per player (or forwarding all the individual streams again to each player)?

From a cost perspective, peer to peer audio will also save server costs. If it is taking all the audio streams as an intermediary to the server, why? Why pay data charges of sending audio to an intermediary server if one does not need to?

Thanks for any clarity.

Hi mikejm_,

Thanks for your question; I hope my answers here will lead to some good discussion. I’ll respond in two parts, first offering my perspective on “what is the point of something like Vivox?,” followed by a more technical post laying out the advantages versus a peer-to-peer voice implementation (which Vivox isn’t). Full disclosure: I’ve been working with Vivox for 9 years, and this first bit is pretty fluffy but also absolutely genuine.

The Vivox SDK provides a battle tested and scalable solution for voice chat across all major gaming platforms, everywhere in the world. It abstracts the network/signaling/media concerns from the developer and offers one simple API to integrate cross-platform voice and text chat between players on desktop, mobile, consoles, and many VR devices. Not only does it “just work” on all these platforms allowing players to talk together, but the audio experience, bandwidth, and CPU/memory resources are strategically optimized for each of these platforms and devices without developers needing to deal with differences in hardware, or different types of audio device (wired, wireless, mono, stereo, etc.).

Ultimately, voice chat is not the focus of game studios and game developers, and uses critical design and engineering resources—time and budget which can be better spent on your game. For Vivox, online communications is the focus, and for nearly two decades Vivox has been working to deliver the lowest latency and lowest packet loss to every geographic region, with high quality codecs, small footprint, efficient compute, and operations at any scale. Beyond the core service (which itself isn’t just voice delivery, but management of audio device selection, volume, muting, blocking, channel and participant management, media file injection, mic test, etc.), beyond this Vivox offers a complimentary suite of moderation tools to combat toxicity, such as server-side recording and the industry leading Safe Voice, along with continuous investment to ensure regulatory compliance with accessibility, privacy, and security laws around the globe.

And of course if you did have broader communication needs (or plan to expand at any point), Vivox offers text chat with history, profanity filtering, direct messaging, etc., along with text-to-speech, speech-to-text, and other value-added services in the same SDK.

For all this, Unity is a trusted partner and voice chat provider for many of the industry’s biggest studios and biggest online games. Yet Vivox provides this same world class service whether you have 10 million+ concurrent users or just 10, whether you have 2 or 2000 players talking together in a single voice channel. And to top it off, if your usage peak is under 5000 users signed into the service at once, the core offering is completely free. To have that many players playing your game not just in a month or in a day, but actually at the same moment in time represents a lot of units sold. Like, your game has to be awfully successful before getting charged, particularly on an ongoing basis rather than just once for a peak during a launch event when interest is also peak.

So if you’re small, it’s like, why not? It’s free and less hassle than rolling your own solution. And if you’re big, same thing: not free, but far quicker, cheaper, and instantly scalable than rolling your own solution. Either way you get to use the same service relied on to power voice in many of gaming’s biggest multiplayer titles, and the benefit of fully staffed Operational, Engineering, and Support teams to run and improve it year-round.

1 Like

Alright, so this comment is a lot more technical and gets into the weeds on peer-to-peer networking, and the challenges with using it for voice chat.

Even if a homegrown solution based on P2P WebRTC is made to work initially, it is not a scalable solution. As more and more users join, the processing overhead will punish the user experience. That processing overhead needs to be offloaded somewhere, preferably to an intermediate server.

Without a relay server, hole punching or port forwarding is required to achieve peer-to-peer transmission. When the number of clients that should all be able to hear each other grows, either every peer will have to connect to every other peer, or otherwise at least one peer will have to adopt a “router” role. Hole punching is unreliable (and often requires a third-party server anyway) and port forwarding just for voice chat is a hard ask from players these days (more on this below). Also, peers that act as routers need to be trusted to not modify the voice traffic that will be retransmitted. Vivox servers can be trusted, don’t require hole punching or port forwarding, and they live near the backbone of the internet, unlike peers which will be at the ends of the internet.

The end user is most likely behind a NAT. NATs allow the incoming traffic to reach an endpoint address only if it contains an entry. This is for security as this prevents someone over the internet the ability to figure out what is behind the NAT by scanning ports. Port forwarding is a technique where you forcibly make an entry in the NAT table so that a foreign entity can connect to a service behind NAT. This is the basis of hole punching in the NAT. It is difficult to make it work manually because oftentimes the end user port is randomized, and a host user would then need to create an entry that maps correctly from a local port to the right end user port. In many cases there are a lot of manual steps to be done (and could often include troubleshooting end-to-end connectivity outside the game, like over Discord or something) just for them to use voice chat. On top of that some home routers do not expose this option via web UI.

I say that port forwarding being set up by players is a hard ask because the ideal scenario is already a lot to go through. Think about it this way: ideally, the player is an adult, on a PC or with quick access to a PC, familiar with their home networking equipment, and capable of translating a generic instruction to open port # NNNNN into steps for forwarding a port to their specific home router (or other local firewall). They access their firewall’s admin webpage, add the port forwarding entry, and perhaps restart their router. Voice chat then starts to work!

Most of my friends have trouble finding out how to access their router’s admin page after their first time setting it up. Kids or those without permission to modify the home network’s firewall will have to speak to their admins (parents). If a player is on mobile or a game console without a PC at hand, then a router admin page will be difficult to use on those devices or even sometimes unsupported.

With that level of resistance, a purely peer-to-peer voice chat solution relying on port forwarding alone would be by default disabled, and working only when players encounter another player that has successfully forwarded the (right) ports. Consequently, the voice chat experience in such a game would be worse according to how young, mobile/console oriented, and less tech savvy the game’s demographic is, and even more so if you intend to matchmake random players who don’t know each other outside the game.

It’s worth pointing out as well that many of these challenges setting up peer-to-peer connections not only make peer-to-peer only voice chat difficult, but can also apply to more general game networking. And many of the same answers apply. Central servers can be trusted (help prevent cheating), are placed in better distributed and commercial quality infrastructure with known hardware capabilities (more consistent and lower latency, etc.), don’t require network knowledge and hackery to achieve connections, and so on!

It’s not that P2P doesn’t have its place in networking at all necessarily, but hopefully this explains some of the advantages and good reasons to consider a central server approach (like Vivox uses), especially as a game scales up.

1 Like

I’d say my main takeaway is this: irrespective of the technology backing it, one big reason developers choose Vivox instead of a custom voice chat solution is the same reason they choose to use a game engine like Unity instead of programming their own lightning, and physics, and method of efficiently translating model geometry and special effects data in a scene into pixels to draw on the screen 60 times a second. These are already solved problems, and more likely than not, they’re not the problems you’re personally interested in reinventing solutions for.

Choosing an off-the-shelf solution for voice chat, or friends, or leaderboards, or matchmaking makes sense because those things are probably not what inspires you, or makes you passionate about making a game in the first place, nearly so much as things like music, characters, storytelling, or combat design unique to your game. Using Vivox lets you focus on your Super Cool Idea™ while still getting to have quality voice chat in your game without the busy work of actually doing it.

Hope this helps :smile:

1 Like

Thanks for both your very detailed answers. Based on the cost here Vivox also seems quite reasonable: https://discussions.unity.com/t/895158

I have learned a lot. I would like to understand the technology a bit more further. Please correct me, fill in details you think I might benefit from, or clarify how VIVOX compares to anything I am describing.

When a WebRTC connection for P2P networks, in general, it can be done so through STUN or TURN.

I presume when you are referring to troubles with NAT’s and hole punching you are referring to STUN P2P connections. I presume you are referring to manual port forwarding being needed to be set up in cases where STUN fails. I see people say “STUN doesn’t work with all NAT/firewall setups” for example, and thus usually in low budget setups, one aims for STUN then if it fails TURN.

TURN by contrast should not require any special port setting up or trouble. It also I believe protects your users’ IP’s from one another which is valuable in a different respect. But one must then set up and maintain a TURN server and handle the bandwidth passing through it (thus not much better than running a media server).

Is TURN generally only also for 1:1 connections? Like P2P? So a TURN solution with 5 players would include each player connecting to 4 other players P2P and routing the data through the TURN server(s)? This obviously comes with scalability limitations and impracticality.

I see here: https://stackoverflow.com/questions/61287054/understanding-sfus-turn-servers-in-webrtc people discussing that even with an SFU media server, one should still have TURN capacity. Is this true or offered by your system?

Either way for full function, if DIY, one must then build a WebRTC media server like one based on Jitsi or Janus. Then one must set up numerous instances (eg. Kubernetes/Docker) for scalability and manage the scaling and connectivity. Plus custom code it to connect and interact with the Unity game. Or one can use Vivox which will do all this in a way that integrates to Unity cleanly and easily.

With WebRTC media servers, I understand almost everyone uses SFU (where you get one separate stream from every player in range) because although there is more bandwidth cost, it saves on extreme server coding costs of trying to mix streams like MCU (where you would only get one stream). I always thought MCU would be best but then as I hear the server latency/cost/processing is too heavy.

If I am understanding Vivox so far, does Vivox offer MCU or SFU or both? In cases of SFU, when a player is out of audio range, but still in the game, does their stream just stop to preserve bandwidth?

I am very impressed by your safety features you are working on. I once imagined running a voice-to-text library on my players audio streams to try to confirm misbehavior but obviously this was absurdly impractical. It is a challenging thing to fix.

Lastly: does Vivox offer any video chat options? In addition to my main game where ~20-30 people per game will need proximity chat functions I would also like to allow people to do one on one video calls in Unity.

I can solve one-to-one video chatting separately likely using my own STUN/TURN solution (or Agora, Twilio or Azure Communication Services) but getting everything through Vivox might be favorable and get me access to similar abuse protections as you keep developing new ones likely over time.

Thanks again for you help. I am learning. I hope I am making sense and understanding better what Vivox does. I appreciate any further clarification/guidance. :slight_smile: