Stuart's "Law of Quadratic Reliability"

There are many reasons that networking code is very hard to write, but one of the less well known is a reason I call the "Law of Quadratic Reliability".

Let me illustrate using a computer game as an example. Say your computer game has bugs in it (as most software does) which causes it to crash occasionally. Say that on average the game runs for ten hours before crashing. This will certainly be annoying to your customers, but not devastating. Most of the games they play will complete without problems, so they probably won't be too angry about your occasional hard-to-find bug.

Now imagine that your game is running on ten networked computers, with ten players. Because there are ten copies of your program in use, the overall probability that one of the ten machines will suffer a crash during the game due to that bug is ten times higher. The ten player game now runs on average for only one hour before one of the machines suffers a crash. If that crash stops the game, then you've ruined the game-play experience for not just one player, but for all ten. Your bug is not just ten times more serious in a ten player game -- it's a hundred times more serious. It is ten times more likely to happen, and when it does happen it annoys ten times as many people.

That's why it's the law of Quadratic Reliability -- the reliability requirement of a multi-player game goes up proportionally to the square of the number of players.

In some ways the problem is even worse than quadratic:

Multi-player games tend to last a lot longer than single-player games, so the likelihood of your bug biting goes up because of that reason too. In single-player games you can limit the damage the bug causes by allowing the player to save the game state at key points and resume it later. That is very hard to do in a multi-player game. (Who saves the game? One player? Every player? Who resumes the game? How do they resume the game? What if all the original players are not present when they resume the game? What if new players want to join? etc.)
Multi-player games tend to be more effort to set up than single-player games. You can play a single-player game any time you choose, but a multi-player game needs other players. Say you invite a bunch of friends around to your house, or persuade colleagues to stay late after work to play, and then half-way throughthe game croaks, ruining everyone's enjoyment. You're not going to be very popular, and they're probably not going to agree to play again next time.
Networked games have more ways to go wrong than single-player games. There are additional failure modes that a single-player games just doesn't have to deal with. What if the network loses a packet? What if duplicate packets arrive? What if packets arrive out of order? What if the network completely flakes out and loses all packets for 15 seconds (like when someone changes an Ethernet terminator)? What if some network hub goes down completely and your players get partitioned into little groups that can still see each other, but no other players? What if someone knocks out the network cable from their computer, disconnecting themselves from the network? These are all extra ways for your program to to wrong that just don't happen in single-computer single-player games.

So this is the double-jeopardy of networked games compared to single-player games -- it is both much harder to make a networked game that is bug free, and much more important to do so.

The problem of Quadratic Reliability can be somewhat mitigated if you can make your game networking code robust enough that the game can continue even if one or more of the player's machines crash. This is also very hard to do, and most multi-player games do not attempt it.

Page maintained by Stuart Cheshire
(Check out my latest construction project: Swimming pool by Swan Pools)