Lately, I've been working on a geographically distributed clustered web system and got several ideas I would like to share. All of them are written by blood, sweat and pieces of lost dollars.
Don't rely on stability
"Yes, this is obvious" - you may say. But I was knowing this from start the project also. However during deployment and preproduction testing the life has shown many new tricks to me. Building network system all conscious developers will introduce some retries/recovery routines. However, absence of components stability has many implicit consequences.
For example, let's say we have a server on 100Mbit/s channel connection which is enough to serve 50 concurrent jobs. So you are setting up the system to run 50 jobs and go to relax a little. A day after data center company narrows your channel down to 10Mbit/s due the maintenance. Viola! Your servers start dying.
Or another example (this design problem incorporated into network libraries I use):
Let's say you have a network service registry which is used by cluster members to discover addresses of needed services. Everything works well except that all nodes participating in the cluster check connectivity and availability of each other and authorized to delete record about dead servers. But we are in the real world, you know.
Let's say we have 4 servers: A, B, C, D. Now node D tries to ping node A and fails. What will it do? It will kill A's record in the node registry. BUT ping failure does not mean A is dead. It means D can not connect with A. Other nodes (B, C) may work with A without a problem still. Now, when D has deleted A's service record ALL nodes will fail to talk with it! And the only error was to suppose A-D link has the same properties as A-B and A-C links.
So be aware. Don't make any assumptions on stability and properties. It would be better to make the system adaptive.
Don't rely on DNS system
Actually this is a derivative from the previous advice. DNS system was designed a long ago and its problems are well know. In most cases dns servers work over UDP that is not reliable. Even if nobody will DOS attack your dns, it may become source of failures due looses in UDP packets and so on.
But the most important source of risk is distributed nature of DNS. In most cases it does not under your full control. There are TLD server, DataCenter DNS server, servers of domain registrator and so on. When problems occur you may t spent days talking with system administrators and trying to find what is going on.