It looks like "unmetered", or "unlimited", or 99.999% uptime is no longer the buzz word in web hosting. The buzzword now is clusters.
If you look for details, they are no where to be seen. Clusters, what are they? Web hosting companies would have you believe clusters
are so much more reliable than a single shared server, or even your own dedicated server. Is this so? "Clusters" by itself really have
no meaning, it describes nothing or it means so many things that it really needs an explanation to make any sense.
Questions: What exactly is being clustered? How does it work, what does it do?
Chances are, the web hosting company have no idea other than its a buzzword to use in their advertising. To choose a good hosting
provider, you need to separate the wheat from the chaff. To do that, you need to understand the basics.. clustering is the end
solution to a problem. The problem is HA (High Availability), or how to achieve HA. Let's start with a single server. How do you
reduce downtime, i.e. have it available more, or to have higher availability? There are many ways for the server to fail. For each
failure that can cause the server to go down, we call it a Single Point of Failure or SPOF. There are very obvious SPOFs in a single
server. They are as follows:
- Power supply
- Harddisk
- RAM
- Human (System administrator, programmers, users)
For power supplies, chances of failure can be reduced by having redundant power supplies. It is a misnomer of course. The
redundant power supply is not really redundant, it is there so that if a power supply fails, the spare will continue to supply power
to the server. Of course, if you are able to power dual sourced power to the power supplies, i.e. one power utility powers one of
the power supply, another utility powers another power supply etc; and add to it, UPS, and generators, that will cover most
of the problems the power might face. There's still the power train, but less likely to fail. Also a leakage to earth might trip a
circuit breaker etc. In any case, an enterprise server must have redundant power supplies.
Harddisk. Easily the most crucial part of the server. Any other part can be replaced without losing data. Lose the harddisk, and we
might lose data. RAID comes in here. Usually RAID 1, which is mirroring. For RAID, higher numbers does not mean better. In particular,
RAID 5 is largely useless for web hosting for busy websites, as RAID 5 implementations usually performs very badly when there are
multiple reads and writes. For servers spec-ed for enterprise, you need multiple storage cards connecting to multiple storage, which
are mirrored, such that the storage card, i.e. the host bus adaptor (HBA) does not become the SPOF.
RAM for servers need ECC at a minimum, and goes on to RAID 1, and further on to hotswappable for the truly highend.
The human, otherwise known as the system operator, or sysadmin, or programmer etc etc, is unfortunately, one of the weakess link.
This is the person who goes rm -rf / ohhh.. opps. OR trips over a power cable OR pulls out the wrong harddisk out of a RAID array.
Training and experience is important here.
After all of the above is covered, and if the availability is insufficient, we can start thinking about clustering in the sense of
having a duplicate server; i.e. a passive - active cluster configuration. For most of everything you have above, duplicate them, and
have the equipment sitting around (passive) doing nothing except take over when your active server dies. How does the passive server
(otherwise known as node) know to become active? Heartbeat comes in here. This can be in the form of a cross cable between the two
nodes. Software on both nodes tries to communicate with each other, and if the communication gets cut off, the passive node tries to
become active.. However, the heartbeat link can be a SPOF.. so duplicate that, and we have 2 heartbeat links.. oh, and if possible,
please get separate NIC (network interface card) for each of the link. Then we run into the problem of storage. How to ensure the
passive node has updated information? A shared storage comes into the picture, where each node have access to a shared storage.
Then we run into the problem of ensuring the active node is not still accessing the storage, i.e. it is actually dead, and the
passive node can take over the storage...
It goes on and on, and along the way, SAN (storage area network) gets introduced, or perhaps drbd for the DIY poor man version of
clustering.
When do you failover a node? if you have mail, dns, web, pop, imap etc on a server, does your software try restarting the individual
daemons, how many times does it try before failing over the node? does 1 failed service justify failing over the node? after failing
over, how do you fail back ? or do you have 1 mail cluster, 1 web cluster, 1 pop cluster etc? how do you share storage?
The point is, most webhosting companies which does not explain what their clustering is doing, is just doing a poor man version
without a good background, with cheap servers, without reducing the SPOF which should be fixed within the sever, and is a worse
risk than if they provide you with shared or a single server. imagine: use NFS for mail spool? that's what some of these guys will
try.
Clustering.. is a double edged sword. Do it well, and it just might work.. do it less than perfect, and you are at the brink of
disaster.. everything looks great, and one day it just goes poof. where possible, keep things simple.
07 Feb 2008
|