WebGL / Linux server crashing after a while

Hi, I need help with my app.
So I’m using NGO, with Unity Transport.
I have a linux server hosted on OVH to host the unity server instance.
The client is hosted on the same server, it’s a node.js server hosting the webgl and a database.
WebGL and Unity servers communicate with wss on Unity transport.

However, after players joins, the server seems to crash.
After a while, new connections are stuck on the loading screen.
Then, a few minutes later, everybody is disconnected.
The server is still running, but the logs are full of these messages:

Error sending message: Unable to queue packet in the transport. Likely caused by send queue size (‘Max Send Queue Size’) being too small.
CompleteSend failed with the following error code: -5

Is it that my server can’t handle the workload?
A bad RPC?
My code is rather simple, I just synchronise the positions of the players (client own), and have a few RPC for stuff like player interactions.

Thank you

What version of the Unity Transport package are you using?

1 Like

Thank you for you answer, Simon.
I’m using 2.4.0. I had the same issue with 2.3.0

I did upgrade the server and implemented networked object visibility since then, we’ll see if that makes a difference.

We used to have issues with that same error (-5, network queue being full) with WebSockets when the TCP socket buffers were full, but I thought we had resolved those in Unity Transport 2.2. That it’s still happening in 2.4 is concerning. I’ll have a look and see if we haven’t missed anything.

1 Like

I see, thank you! Let me know if there’s anything else I can provide to help.

I looked into this a bit more and the error message itself can show up when TCP buffers are full but it shouldn’t prevent forward progress. The error is more of a warning in this instance. You can reduce the likelihood of the message appearing by increasing the value of the ‘Max Send Queue Size’ parameter of the Unity Transport component. You can easily set this value in the thousands without issue. The only impact is slightly increased memory usage.

Perhaps with this error reduced other more relevant error messages will become more prominent.

Also you mention having multiple players connected. At how many players do you start seeing this issue? Also is it possible that some of these players are kept for a long time in background tabs in your browser? I’m wondering if perhaps the browser is not throttling down some background WebSocket connections which could cause some buildup on the server side if clients are not pulling in messages regularly enough.

1 Like

Yes, logs are more useful this way, thanks!
I optimized my RPCs further, trying to send less data. It seems to crash a little less frequently.

So the playercount might vary, we don’t really have a hard limit yet but I’m probably going to add a queue to ensure that we don’t have too many players.

I think our peak was something like 50 players, without any issue. Might have been more with people trying to connect.
Usually it starts getting problematic near 20/25 players. I think we sometimes had crashes with fewer players. It usually stays up at time with an uptime of a few hours. At peak hours, it’s more near 10/15 minutes.

I did a lot of change since the first message, with many optimizations like handling visibility, so I’m hopeful that the crashes will at least be a little less frequent.

There are, rarely, these warnings too. But they don’t seem related to the crashes:

  • Failed to decrypt packet (error: 1048580). Likely internal TLS failure. Closing connection.
  • TLS handshake failed at step 1. Closing connection.

Also is it possible that some of these players are kept for a long time in background tabs in your browser? I’m wondering if perhaps the browser is not throttling down some background WebSocket connections which could cause some buildup on the server side if clients are not pulling in messages regularly enough.

Oh yeah definitely, is it something we should avoid? To be fair, we’ve been monitoring and helping people on the app because we have some people using it who are not very tech friendly. So we usually have at least two AFK players on the server, sitting on a bench and doing nothing.
Should I play around with heartbeats/disconnect timeout to avoid issues with this?

Thank you again for your time, I’ll try to investigate some more.

EDIT: it actually just crashed with two players, so this might not even be the issue.
Although, the more players joins, the more likely it seems to crash at some point. It doesn’t seem to have some issue at night for instance.
I’ve probably made a mistake somewhere but no exceptions are thrown.

Oh yeah definitely, is it something we should avoid? To be fair, we’ve been monitoring and helping people on the app because we have some people using it who are not very tech friendly. So we usually have at least two AFK players on the server, sitting on a bench and doing nothing.
Should I play around with heartbeats/disconnect timeout to avoid issues with this?

Unfortunately it’s not exactly something that can be avoided. It’s always possible an actual player will do this and that’s out of your control. I was mostly curious if it could be the case because it could explain the send queue filling up so quickly.

Heartbeats will likely not have much of an impact if the connections are throttled down by the browser, but decreasing the disconnect timeout could help. Note however that this does mean that players will be kicked out of the game more quickly if inactive. I’d suggest only doing that if you have a good way to quickly reintroduce a player into the game.

Regarding the TLS errors, they should not cause crashes of the server, but you’re the second user I see facing those, and they really should never appear other than under rare circumstances. There might be something wrong with our TLS layer. I’ll make a note to investigate this on my end.

Also while looking at the code to see what might be happening in your situation, I stumbled upon a bug that could explain the -5 errors that you are seeing. It’s a bit of a shot in the dark and I still need to do some more validation of the fix on my end, but if you want to try it, here’s the patch (the package to modify is com.unity.transport; you can modify it locally):

diff --git a/com.unity.transport/Runtime/TCPNetworkInterface.cs b/com.unity.transport/Runtime/TCPNetworkInterface.cs
index cdefa433..c7425bdc 100644
--- a/com.unity.transport/Runtime/TCPNetworkInterface.cs
+++ b/com.unity.transport/Runtime/TCPNetworkInterface.cs
@@ -641,7 +641,13 @@ private void ProcessPendingSends()

                     var connectionState = ConnectionList.GetConnectionState(connectionId);
                     if (connectionState == NetworkConnection.State.Disconnected)
+                    {
+                        // Re-enqueue the buffer and drop the packet. We'll never be able to
+                        // actually send this pending send if the connection is down.
+                        SendQueue.EnqueuePacket(bufferIndex, out packetProcessor);
+                        packetProcessor.Drop();
                         continue;
+                    }

                     var connectionData = ConnectionMap[connectionId];
1 Like

Thank you very much, I’ll try it out!
Do you know the best way to simulate fake users, to test the server load?
I know of tools like JMeter but I’m curious what the best way to do that is with NGO.

Unfortunately I’m not sure what the best way would be with NGO (I’m more familiar with the Unity Transport library). It’s technically possible to instantiate multiple NetworkManager objects (NGO’s tests do it) so that could be an option. Just be careful not to access the singleton in this case.

Another option would be to create a tiny client build and run multiple of them in batch mode. Could even build with the dedicated server target in this case to make the builds smaller and more memory-efficient.