Friday, December 14, 2007

Mines in the Field of GWT Development Planning

Recently, Google Web Toolkit attracted attention of web developers from all over the world. GWT is a great technology for AJAX development actually. It helps to get rid of many head-aches associated with cross-browser development, user interaction and development cycle.

The library provides unprecedented possibilities to build Web 2.0 applications with high levelof interactivity. And there is a trap.

Often, the user interface of web-applications is simpler to build than desktop UIs due to two factors. First, the HTML/CSS ecosystem provides for a great number of tools to easily express designers vision. And there is no need to build/run/debug cycle for the web-pages.

Second, HTML provides for a significantly lower level of interactivity that is encapsulated by browser. So developers don't need to debug each.

Revolutionary AJAX development GWT changes these factors also. GWT apps can be highly interactive and they are component based. So the initial expectation about easy change of web applications’ look'n'feel is not completely true anymore. Now, if you want to change it in the visual design this may require a significant effort. Moreover, as we now use components visual design is limited by available set of controls. A little spicy text edit field invited by a web-designer may create a real mess for the dev team. Isn't it like in the good old times of the desktop application UI?

So GWT development requires very tight cooperation of the design and development teams. Keep these guys in one place, preferably in a room with closed doors and a small window to serve food.

Another feature of a GWT application is interactivity. Yes, this is a great feature and a great head-ache.

As I have said before, web 1.0 applications rarely provide an interactive response on the user's actions. So the first move of web designers is to provide static mockups of the UI. This does not work here. Developers need a specified behavior of the application. And the behavior is a part of the code. That is not an easy thing to change. So you need to have enough reserved resources to debug and tweak the UI.

So let me say this again. Keep your developers and interface designers as close as possible. May be this is why Google apps (gmail, reader and others) so well thought. Usability the team works very closely with developers.

Wednesday, December 12, 2007

Production system maintenance. Part 2

In the previous post I've described several organizational moment of the production system update. Now it is time for technical tips and tricks.

Name code equally on all system nodes
Usually, this is a good idea to name all things equally across systems you manage. It can be a real problem to have jboss4, jboss4.0.5 and jboss as names of the same piece of code on you cluster nodes. Use single and simple convention for code naming and placement.

Name resources differently
While single naming schema for code lowers amount of time spent on unproductive things, resource naming is much different. In mature environment code can be easily restored from several places like developer machines, continuous integration, staging servers, source repository & etc. Resources (i.e. data) are different. Information and schema of your database might be unique on some moments. Moreover, the data may represent great value for your company.

So damage to data should be avoided by any means. As the first line of defence name your databases differently depending on type of environment (production, testing, development & etc), contents and schema version. I usually use following db identifiers:

contents-timestamp_of_last_schema_update-database_type

For example: chirp-20071210-prod While looks quite complex, the given notation may protect you from actions done on a wrong data by mistake. Unfortunately, "Oh God, I've dropped wrong database" problem is not so rare I was expected firstly. ;)

Use consistent hostnames
Correctly set hostname is not an absolute requirement for server functioning. But right names may give you some help during exploitation. It is funny to have server called 'snoopy', if you have 3 total. When you have more, their names should be little more transparent and linked with their properties as the hosting data center and ip address.

Also server software need to know name of the server it runs on. That is why hostname and dns name must be the same.

Highlight your current context Most maintenance errors I have seen were "right command with in a wrong context". For example, I have dropped production db while were thinking it was staging environment. (This is why I make backup of all data on the production server before update now).

So put information about your current host, database, directory everywhere.
  • Put hostname, username and directory must present in the command line prompt.
  • Put hostname into the xterm title.
  • Put hostname and database name into mysql client prompt.
And be attentive.
Don't work as root
Actually, this is impossible. Just try to work as superuser as less as possible. root or Administrator can do many dangerous things. You know, one error and your root filesystem is empty. ;-)

Have a remotely controlled power switch
This is actually is not required for VPS. However, this may be essential for
dedicated servers. It is not a so rare task for system administrator to configure network remotely. And this may be very very very inconvenient to have a node with improperly configured network which can be managed by ssh only.

Yes, data center support team may press "Power" for you. But only if you pay for 24x7 support.

Use cluster management software
Not long ago, Debian package of the day published an article about ClusherSSH which I found very useful. There are several similar programs also. While relatively simple, these programs may easy your life and decrease number of errors.

Production system maintenance

Lately, I've spent around 5 hours recovering production system from failure caused by improper actions of engineering staff during update procedure. I was lucky, so it was not me who has pressed the red button. But after that emergency situation I was trying to analysis causes of the problem and establish a better deployment process for software update. So these are thoughts somebody may found interesting:

Keep people without relevant experience out during important updates
Yes, some are pretty cool in development and they may be very attentive. But production system update require engineer to be a little paranoid. If you have not bad experience in this area, you probably will be unable to predict side-effect and bad-thing-that-might-happen. In ideal case the update must be done by a pair of engineers. Human-factor will be depressed much in that case.

Have a detailed update plan
Well, if the first thing is not the case... Prepare a detailed update plan with all things (shell commands, file & database operation & etc) included. It would be wonderful if you test it in your staging environment first. In ideal case you must have an auto-install package (deb, rpm or what ever).

Have a detailed fallback plan
This is continuation of the previous thought actually. Your update plan should include graceful fall back procedure description. Some errors appear on production system only. Or your development team might miss something, or . So you must have a way to return to previous version. Always.

Have a Plan B
In some cases fall back scenario can not help. For example in case of improper execution of update script. For example, if somebody deleted important files or dropped wrong database. Yes, your update script contains backup step. But what if that step failed silently?
General rule is "wait for trouble, always". So regular backup of production is a must have thing. Some data centers (as ours) even provides daily server backups in default service package.

Wednesday, November 14, 2007

Are synchronous systems doomed?

Asynchronism of communication is a basic principle of many distributed systems. Did you ever think why? Why developers do so much staff to handle things asynchronously while sync communications are much simpler?

If you'll see around you shall find many examples of asynchronism. Most organizations exploit this principles, engineers exploit this principle, even nature
has built neural systems asynchronous.

So why? The point is implicit parallelism and reliability introduced by such systems.

Let's imagine a chain of components used by each other and connected by relatively long "links". By "relatively long" I mean information propagation time can be compared with processing time or longer. In synchronous systems the first component in the chain will wait at least while other component shall receive the message. In worst case it will wait for results of the message processing. So transmission delays will sum and increase the first component idle time.

Link failure in that case is a disaster. The only general way to detect it is timeout which increases request handling time (read the fist component idle time). Moreover, there is no way to learn was the request processed or not.

Obviously, synchronous requests are hard to persist. As recovery of distributed state (which in synchronous system is usually represented by execution stack or system components) represents magnificent technical and administrative challenge.

Asynchronous systems are something different. Components work in 'fire-and-forget' mode. So no time (resources) is spent to conserve. Components do not idle and may create additional requests, process data & etc. So it is much more parallel.

Moreover, it is easy to design system in a way when all state is incapsulated inside the messages. Persisting them system may archive significant firmness for link/components failures.

Now take a look onto modern hardware design. In common computer (actually in supercomputers also, but there can be exceptions) CPUs are quicker than communication channels. This has little common with multicomponent architecture. But the trend is not to increase number of CPUs linked with _communication channels_ either on single computer level and on distributed cluster level.

eBay is building their systems in async way. The most part of payment transactions are processed in async way.

Don't you think it is time to learn message passing libraries and designs?

Thursday, October 25, 2007

A perfect wiki

We are using wiki as a knowledge storage on one of the project. While it has many good collaborative features, there are some inconveniences in it.

First of all, it is hard to write rich formatted documents or documents with complex structure. So I my requirements for wiki is all about that.

1. The wiki should provide view representing hierarchy of pages
2. Import/Export to ODF or another word processor format is must have.
3. Easy table editing
4. Wiki must send daily changes summary by email
5. All arbitrary wiki staff like styles and etc.

Friday, September 7, 2007

Things you'd like to avoid during system design

Lately, I've been working on a geographically distributed clustered web system and got several ideas I would like to share. All of them are written by blood, sweat and pieces of lost dollars.

Don't rely on stability
"Yes, this is obvious" - you may say. But I was knowing this from start the project also. However during deployment and preproduction testing the life has shown many new tricks to me. Building network system all conscious developers will introduce some retries/recovery routines. However, absence of components stability has many implicit consequences.

For example, let's say we have a server on 100Mbit/s channel connection which is enough to serve 50 concurrent jobs. So you are setting up the system to run 50 jobs and go to relax a little. A day after data center company narrows your channel down to 10Mbit/s due the maintenance. Viola! Your servers start dying.

Or another example (this design problem incorporated into network libraries I use):
Let's say you have a network service registry which is used by cluster members to discover addresses of needed services. Everything works well except that all nodes participating in the cluster check connectivity and availability of each other and authorized to delete record about dead servers. But we are in the real world, you know.

Let's say we have 4 servers: A, B, C, D. Now node D tries to ping node A and fails. What will it do? It will kill A's record in the node registry. BUT ping failure does not mean A is dead. It means D can not connect with A. Other nodes (B, C) may work with A without a problem still. Now, when D has deleted A's service record ALL nodes will fail to talk with it! And the only error was to suppose A-D link has the same properties as A-B and A-C links.

So be aware. Don't make any assumptions on stability and properties. It would be better to make the system adaptive.

Don't rely on DNS system
Actually this is a derivative from the previous advice. DNS system was designed a long ago and its problems are well know. In most cases dns servers work over UDP that is not reliable. Even if nobody will DOS attack your dns, it may become source of failures due looses in UDP packets and so on.

But the most important source of risk is distributed nature of DNS. In most cases it does not under your full control. There are TLD server, DataCenter DNS server, servers of domain registrator and so on. When problems occur you may t spent days talking with system administrators and trying to find what is going on.