WebGL / Linux server crashing after a while

Hi, I need help with my app.
So I’m using NGO, with Unity Transport.
I have a linux server hosted on OVH to host the unity server instance.
The client is hosted on the same server, it’s a node.js server hosting the webgl and a database.
WebGL and Unity servers communicate with wss on Unity transport.

However, after players joins, the server seems to crash.
After a while, new connections are stuck on the loading screen.
Then, a few minutes later, everybody is disconnected.
The server is still running, but the logs are full of these messages:

Error sending message: Unable to queue packet in the transport. Likely caused by send queue size (‘Max Send Queue Size’) being too small.
CompleteSend failed with the following error code: -5

Is it that my server can’t handle the workload?
A bad RPC?
My code is rather simple, I just synchronise the positions of the players (client own), and have a few RPC for stuff like player interactions.

Thank you

What version of the Unity Transport package are you using?

1 Like

Thank you for you answer, Simon.
I’m using 2.4.0. I had the same issue with 2.3.0

I did upgrade the server and implemented networked object visibility since then, we’ll see if that makes a difference.

We used to have issues with that same error (-5, network queue being full) with WebSockets when the TCP socket buffers were full, but I thought we had resolved those in Unity Transport 2.2. That it’s still happening in 2.4 is concerning. I’ll have a look and see if we haven’t missed anything.

1 Like

I see, thank you! Let me know if there’s anything else I can provide to help.

I looked into this a bit more and the error message itself can show up when TCP buffers are full but it shouldn’t prevent forward progress. The error is more of a warning in this instance. You can reduce the likelihood of the message appearing by increasing the value of the ‘Max Send Queue Size’ parameter of the Unity Transport component. You can easily set this value in the thousands without issue. The only impact is slightly increased memory usage.

Perhaps with this error reduced other more relevant error messages will become more prominent.

Also you mention having multiple players connected. At how many players do you start seeing this issue? Also is it possible that some of these players are kept for a long time in background tabs in your browser? I’m wondering if perhaps the browser is not throttling down some background WebSocket connections which could cause some buildup on the server side if clients are not pulling in messages regularly enough.

1 Like

Yes, logs are more useful this way, thanks!
I optimized my RPCs further, trying to send less data. It seems to crash a little less frequently.

So the playercount might vary, we don’t really have a hard limit yet but I’m probably going to add a queue to ensure that we don’t have too many players.

I think our peak was something like 50 players, without any issue. Might have been more with people trying to connect.
Usually it starts getting problematic near 20/25 players. I think we sometimes had crashes with fewer players. It usually stays up at time with an uptime of a few hours. At peak hours, it’s more near 10/15 minutes.

I did a lot of change since the first message, with many optimizations like handling visibility, so I’m hopeful that the crashes will at least be a little less frequent.

There are, rarely, these warnings too. But they don’t seem related to the crashes:

  • Failed to decrypt packet (error: 1048580). Likely internal TLS failure. Closing connection.
  • TLS handshake failed at step 1. Closing connection.

Also is it possible that some of these players are kept for a long time in background tabs in your browser? I’m wondering if perhaps the browser is not throttling down some background WebSocket connections which could cause some buildup on the server side if clients are not pulling in messages regularly enough.

Oh yeah definitely, is it something we should avoid? To be fair, we’ve been monitoring and helping people on the app because we have some people using it who are not very tech friendly. So we usually have at least two AFK players on the server, sitting on a bench and doing nothing.
Should I play around with heartbeats/disconnect timeout to avoid issues with this?

Thank you again for your time, I’ll try to investigate some more.

EDIT: it actually just crashed with two players, so this might not even be the issue.
Although, the more players joins, the more likely it seems to crash at some point. It doesn’t seem to have some issue at night for instance.
I’ve probably made a mistake somewhere but no exceptions are thrown.

Oh yeah definitely, is it something we should avoid? To be fair, we’ve been monitoring and helping people on the app because we have some people using it who are not very tech friendly. So we usually have at least two AFK players on the server, sitting on a bench and doing nothing.
Should I play around with heartbeats/disconnect timeout to avoid issues with this?

Unfortunately it’s not exactly something that can be avoided. It’s always possible an actual player will do this and that’s out of your control. I was mostly curious if it could be the case because it could explain the send queue filling up so quickly.

Heartbeats will likely not have much of an impact if the connections are throttled down by the browser, but decreasing the disconnect timeout could help. Note however that this does mean that players will be kicked out of the game more quickly if inactive. I’d suggest only doing that if you have a good way to quickly reintroduce a player into the game.

Regarding the TLS errors, they should not cause crashes of the server, but you’re the second user I see facing those, and they really should never appear other than under rare circumstances. There might be something wrong with our TLS layer. I’ll make a note to investigate this on my end.

Also while looking at the code to see what might be happening in your situation, I stumbled upon a bug that could explain the -5 errors that you are seeing. It’s a bit of a shot in the dark and I still need to do some more validation of the fix on my end, but if you want to try it, here’s the patch (the package to modify is com.unity.transport; you can modify it locally):

diff --git a/com.unity.transport/Runtime/TCPNetworkInterface.cs b/com.unity.transport/Runtime/TCPNetworkInterface.cs
index cdefa433..c7425bdc 100644
--- a/com.unity.transport/Runtime/TCPNetworkInterface.cs
+++ b/com.unity.transport/Runtime/TCPNetworkInterface.cs
@@ -641,7 +641,13 @@ private void ProcessPendingSends()

                     var connectionState = ConnectionList.GetConnectionState(connectionId);
                     if (connectionState == NetworkConnection.State.Disconnected)
+                    {
+                        // Re-enqueue the buffer and drop the packet. We'll never be able to
+                        // actually send this pending send if the connection is down.
+                        SendQueue.EnqueuePacket(bufferIndex, out packetProcessor);
+                        packetProcessor.Drop();
                         continue;
+                    }

                     var connectionData = ConnectionMap[connectionId];
1 Like

Thank you very much, I’ll try it out!
Do you know the best way to simulate fake users, to test the server load?
I know of tools like JMeter but I’m curious what the best way to do that is with NGO.

Unfortunately I’m not sure what the best way would be with NGO (I’m more familiar with the Unity Transport library). It’s technically possible to instantiate multiple NetworkManager objects (NGO’s tests do it) so that could be an option. Just be careful not to access the singleton in this case.

Another option would be to create a tiny client build and run multiple of them in batch mode. Could even build with the dedicated server target in this case to make the builds smaller and more memory-efficient.

Wanted to bump as I am running into this exact same issue

  • Unity 6000.0.30f1
  • Dedicated Server 1.3.2
  • Transport 2.4.0
  • NGO 2.1.1
  • WebGL Client / Linux Server.
  • Occurred within 5 minutes after 3 players connected.
  • I think it was triggered after a SendTo.Server RPC was triggered, but I can’t recreate it.
  • Server is spamming logs until all clients are disconnected.
  • The server continues to run without issue after the clients are disconnected.
  • Clients can reconnect and no further issue occur during test session.
  • Disconnect Timeout set to 30 seconds

Server Log (Wish I had timestamps, but all of this happens within 5 minutes):

[Netcode] [Server-Side] Transport connection established with pending Client-4.
[Netcode] [Server-Side] Pending Client-4 connection approved!
Client 4 connected to the server.
Clients connected:  1.

[Netcode] [Server-Side] Transport connection established with pending Client-5.
[Netcode] [Server-Side] Pending Client-5 connection approved!
Client 5 connected to the server.
Clients connected:  2.

[Netcode] [Server-Side] Transport connection established with pending Client-6.
[Netcode] [Server-Side] Pending Client-6 connection approved!
Client 6 connected to the server.
Clients connected:  3.

CompleteSend failed with the following error code: -5
CompleteSend failed with the following error code: -5
**... (Spams over 100x)**
CompleteSend failed with the following error code: -5
CompleteSend failed with the following error code: -5

[Netcode] Disconnect Event From 4
Client 4 disconnected from the server.
Clients left:  2.

Error sending message: Unable to queue packet in the transport. Likely caused by send queue size ('Max Send Queue Size') being too small.
CompleteSend failed with the following error code: -5
Error sending message: Unable to queue packet in the transport. Likely caused by send queue size ('Max Send Queue Size') being too small.
CompleteSend failed with the following error code: -5
CompleteSend failed with the following error code: -5
CompleteSend failed with the following error code: -5
**... (Repeats this *ababbb* pattern of "CompleteSend..." and "Error sending..." over 100x)**

[Netcode] Disconnect Event From 6
Client 6 disconnected from the server.
Clients left:  1.

Error sending message: Unable to queue packet in the transport. Likely caused by send queue size ('Max Send Queue Size') being too small.
CompleteSend failed with the following error code: -5
CompleteSend failed with the following error code: -5
Error sending message: Unable to queue packet in the transport. Likely caused by send queue size ('Max Send Queue Size') being too small.
**... (Repeats this *abba* pattern of "CompleteSend" and  "Error sending..." 100x times)**

[Netcode] Disconnect Event From 5
Client 5 disconnected from the server.
Clients left:  0.

All clients have disconnected from the server. Shutting down

[Netcode] Shutdown
Unloading 0 Unused Serialized files (Serialized files now loaded: 0)

[Netcode] ShutdownInternal
[Netcode] NetworkConnectionManager.Shutdown() -> IsListening && NetworkTransport != null -> NetworkTransport.Shutdown()
Unloading 0 unused Assets to reduce memory usage. Loaded Objects now: 10583.
Total: 5.079022 ms (FindLiveObjects: 1.054534 ms CreateObjectMapping: 0.351577 ms MarkObjects: 3.650629 ms  DeleteObjects: 0.021531 ms)

[Server] Restarting Server...

I have the server automatically restart once all clients leave, and it continues running without issue.

It almost seems like the errors are blocking the server from sending heartbeats which causes the clients to timeout from inactivity, but that is just a shot in the dark.

Did some more testing today, and I am able to consistently recreate the error spam.
My project’s setup is based on the Dedicated Server Sample project.

  • Client: Bootstrap Scene → Meta-Game Scene → Game Scene
  • Server: Bootstrap Scene → Game Scene
    All of this is replicable using the Editor as server (meaning we have stack traces) and with a single WebGL client.
    All of this occurs on a single Windows computer hosting both the editor game server and WebGL Client.
  1. Start the editor in server mode
  2. Connect with WebGL Client A (I used Google Chrome)
  3. Once connected turn off Wifi for Client A via Chrome’s Network Tools (Throttling Options)
  4. After ~30 seconds (Related to “Disconnect Timeout MS”?) error spam starts

Unity Transport Settings:

Stack Trace for “Error sending message…”

Stack Trace for “CompleteSend failed…”

Thanks for providing a simple reproduction method!

I can’t investigate this right now but I’ve made a note to look into this. In the meantime, if you want to you can try the patch below. It addresses a problem with WebSockets that would cause these -5 errors, although I can’t say if it’s the same problem you’re facing. The file to modify is TCPNetworkInterface.cs in package com.unity.transport (you can edit the package locally).

@@ -524,9 +524,7 @@ public unsafe void Execute()
                     // Detect if the upper layer is requesting to disconnect.
                     if (connectionState == NetworkConnection.State.Disconnecting)
                     {
-                        // Keep the connection alive a bit if we still need to send stuff.
-                        if (!connectionData.HasPendingSends)
-                            Abort(ref connectionId, ref connectionData);
+                        Abort(ref connectionId, ref connectionData);
                         continue;
                     }

@@ -641,7 +639,13 @@ private void ProcessPendingSends()

                     var connectionState = ConnectionList.GetConnectionState(connectionId);
                     if (connectionState == NetworkConnection.State.Disconnected)
+                    {
+                        // Re-enqueue the buffer and drop the packet. We'll never be able to
+                        // actually send this pending send if the connection is down.
+                        SendQueue.EnqueuePacket(bufferIndex, out packetProcessor);
+                        packetProcessor.Drop();
                         continue;
+                    }

                     var connectionData = ConnectionMap[connectionId];
1 Like

This appears to have solved my issue. I have been unable to replicate the problem since doing this. Thank you!

1 Like

Great! Thanks for the feedback! The fix will be part of the next release of the transport package.

I’m glad your problem was solved. I had a similar problem. If you check your server instantly, you can intervene in these situations immediately. You can use it here for free.