Thursday, October 25, 2007

A perfect wiki

We are using wiki as a knowledge storage on one of the project. While it has many good collaborative features, there are some inconveniences in it.

First of all, it is hard to write rich formatted documents or documents with complex structure. So I my requirements for wiki is all about that.

1. The wiki should provide view representing hierarchy of pages
2. Import/Export to ODF or another word processor format is must have.
3. Easy table editing
4. Wiki must send daily changes summary by email
5. All arbitrary wiki staff like styles and etc.

Friday, September 7, 2007

Things you'd like to avoid during system design

Lately, I've been working on a geographically distributed clustered web system and got several ideas I would like to share. All of them are written by blood, sweat and pieces of lost dollars.

Don't rely on stability
"Yes, this is obvious" - you may say. But I was knowing this from start the project also. However during deployment and preproduction testing the life has shown many new tricks to me. Building network system all conscious developers will introduce some retries/recovery routines. However, absence of components stability has many implicit consequences.

For example, let's say we have a server on 100Mbit/s channel connection which is enough to serve 50 concurrent jobs. So you are setting up the system to run 50 jobs and go to relax a little. A day after data center company narrows your channel down to 10Mbit/s due the maintenance. Viola! Your servers start dying.

Or another example (this design problem incorporated into network libraries I use):
Let's say you have a network service registry which is used by cluster members to discover addresses of needed services. Everything works well except that all nodes participating in the cluster check connectivity and availability of each other and authorized to delete record about dead servers. But we are in the real world, you know.

Let's say we have 4 servers: A, B, C, D. Now node D tries to ping node A and fails. What will it do? It will kill A's record in the node registry. BUT ping failure does not mean A is dead. It means D can not connect with A. Other nodes (B, C) may work with A without a problem still. Now, when D has deleted A's service record ALL nodes will fail to talk with it! And the only error was to suppose A-D link has the same properties as A-B and A-C links.

So be aware. Don't make any assumptions on stability and properties. It would be better to make the system adaptive.

Don't rely on DNS system
Actually this is a derivative from the previous advice. DNS system was designed a long ago and its problems are well know. In most cases dns servers work over UDP that is not reliable. Even if nobody will DOS attack your dns, it may become source of failures due looses in UDP packets and so on.

But the most important source of risk is distributed nature of DNS. In most cases it does not under your full control. There are TLD server, DataCenter DNS server, servers of domain registrator and so on. When problems occur you may t spent days talking with system administrators and trying to find what is going on.