I’m seeking advice on the best strategies for managing server failures. Currently, I’m considering a setup with an extra special Client as a secondary passive server that mirrors the world changes and maintains its own world data copy. In the event of a primary server failure, the special client will create a new server world importing all ghosts data, then all clients would switch to this new server and continue the game session.
My questions are:
Is there any plan to implement a similar feature or any solution for server fault tolerance in Unity?
If we were to implement this feature independently, perhaps using a specialized client that’s unaffected by relevancy…, Is this approach even feasible? and do you have any recommendations or advice on this?
Edit:
Since there is a variance in the structure of entities between the server and client, a direct import of entities is unfeasible. We plan to instead transfer ghost data to the new server and apply it to update the states of existing or new ghosts.
Could you provide any recommendations or advice on this approach?
What type of server fault are you specifically referring to?
It’s certainly feasible (as this is how older AAA games - which use a host model - handle host migration), but it’s incredibly complicated to develop.
IMHO it would make more sense to support some kind of server persistence. I.e. Every X seconds, store server game state backups into a short-term DB (even a locally held file).
Then, if the server executable faults, you can at least restore from the file.
From the original server, you mean? You’d have to do this all the time, while the server is still alive. This cost would be significant (as you’d essentially be doubling server cost) unless you don’t intend to run simulation on the backup server (at which point it’s just a database).
Thank you for your response. I’d like to clarify a few points based on your questions:
Regarding the type of server fault I’m referring to, I mean a scenario where, for any reason, the server hardware stops responding at any time. This could be due to a variety of issues, including hardware failures, network problems, or software crashes.
In response to your comment on the original server and the continuous transfer of ghost data, I understand your concern about the potential cost implications. Our plan involves using an additional client as a passive server, solely for maintaining the latest server ghost data, such as transformations, health status, cooldowns, etc.
In the event of a primary server failure, our strategy is to quickly create a new server world and pass the Ghosts data from the additional Client (Passive Server). Once the new server world is up and running with all the transferred ghosts data and properly initialized, all clients will be directed to join it using its IP and Port credentials. This should allow the game session to continue with minimal interruption.
I hope this clarifies our approach. I’m looking forward to any further advice or recommendations you might have on this strategy. like what would be the right approach to create a new server world and Transfert/Recreate the ghosts from the Passive Client World to the New Server World knowing that the Passive Client is always running on the same Machine as the new server world.
This approach would only work in a couple of your failure situations.
E.g. Imagine you’re running the Server executable on a hardware instance. If said hardware faults, it’ll bring down all processes running on it.
Thus, you’ll need this ‘passive client’ to be running on a different hardware instance.
Thus, you need your server orchestration to:
Detect when a ‘primary server’ goes down.
Convert the ‘passive server’ into the ‘primary server’.
Remember all clients who should have been connected to the ‘primary server’, and notify said clients that they should reconnect to this new server.
Kick off another ‘passive server’ and connect it to the new ‘primary server’.
Once the passive client connects to itself, tell the new server to load the data from said passive client.
Resume (a.k.a. unpause) gameplay.
I do want to repeat that I’ll be easier and more effective to just create an ECS Save/Load system, periodically saving the complete server state to a DB instance. Advantages:
The CPU cost of saving periodically (e.g. every 1/5/30/60/300 seconds) is tiny compared to the cost of replicating ghosts to another client + the cost of running said client. ECS data layout is ideal for this.
You only need a DB, rather than double the instances (which you likely already have).
Save/Load would allow you to persist gameplay over significantly longer durations (via DB).
And therefore provide actual backups and persistence in case the instance itself dies.
Save/Load is a nice gameplay feature in and of itself, assuming your game has that need. E.g. Minecraft realms, shared campaigns.
First of all, thank you so much for your invaluable assistance! It’s been a great help. (You’ve saved me many times!)
We’re weighing the pros and cons of both the Passive Server and the DB options, and here’s what we’ve considered:
Pros of the Passive Server Over the DB Option:
Data Precision: The passive server can capture data up to the exact frame of a crash, ensuring no loss of player actions, even seconds or frames prior. This is a significant benefit for user experience.
Bandwidth Efficiency: Including just one additional client for ghost data seems more bandwidth-efficient, especially with delta compression, compared to constantly sending serialized world data for maximum precision.
Performance: The server would only need to manage one additional connection, with all ghosts set as relevant, which seems more manageable.
Quick Unpause: Resuming gameplay would be faster and potentially less error-prone as there’s no need to download the latest world state from an external database.
Cons of the Passive Server Over the DB Option:
Complexity in Ghost Data Transfer: Transferring and reconstructing ghosts from the passive client to a new server world could be challenging. This includes issues like:
Reconstructing ghost entities with different structures between client and server.
Managing ghost entity references within components.
Dealing with the absence of predicted components in the passive server’s interpolated ghosts.
Dependency on an Additional Client: The system requires a constantly present, trusted additional client.
My Questions:
For the Client/Server Time Systems:
Does the system automatically handle the tick adjustments in the Clients world whenever a new connection is established with a server, or a new client world creation is necessary each time there’s a server change?
If we choose the Passive Server Option:
What’s the most effective method for transferring and reconstructing ghosts from the client to a newly created server world?
If we opt for the DB Option:
Does saving and loading the server world to/from a database automatically handle complex scenarios like remapping ghost entity references within components?
Is it feasible to save/load specific entities, such as ghosts entities only, instead of the entire the server world?
Thank you again for your guidance and insights. Your feedback will be crucial in determining our next steps.
Any ghosts which have not yet been replicated to your Passive Client.
The deltas for all ghosts which have not been added to a snapshot (which arrived) on the Passive Client (assuming the snapshot even arrives… which admittedly is likely as it won’t ever leave the cloud).
Any ghosts which are not “relevant” to the Passive Client. You’ll need to ensure that the Passive Client has special rules, removing any relevancy limitations.
It depends, but yeah, this is probably true in the simplest case. Note: You only need to save the server state to another instance (like a DB) if you expect the actual cloud instance to randomly spontaneously die. As far as I’m aware, this is astronomically rare for any major cloud provider. If you’re only concerned about game server executable crashes (significantly more likely), then disk IO should be extremely reliable. You always save to 2 or more save-file copies (round-robin), and thus always have a valid save file (even if you crash during a save operation, corrupting the save).
Netcode handles this correctly yeah. You’ll be able to observe this by simply disconnecting a client and reconnecting it to the same world (via the PlayMode tool buttons).
I’ve honestly never tried, but presumably you’ll just need a system that runs once that performs the “fix-up”. What is required to fix-up is another question. It seems feasible, but there are unknown unknowns. Best thing I can recommend is to try it.
It’s certainly possible (e.g. Minecraft reproducing the same world voxel state via seed + generator function, only storing and replicating chunk changes), but it’s non-trivial. I.e.
You either need a deterministic approach to re-spawn these kinds of entities.
OR you explicitly say that they’re not replicated, and the players & gameplay just deals with that. E.g. A racing game may reset the position of all traffic cones… Although IMHO those should be low-priority static ghosts for game feel anyway, because persistent map destruction in racing games is fun.
The other thing to say is: Are you certain that this backup + restore feature is even necessary? Depending on the game, a tiny % of server crashes is tolerable. E.g. I’ve had matches fail (mid-game) in all major AAA games I’ve played this year.
Don’t take what I say as gospel! You know your use-case better than I do.