Singapore Dedicated Server Bandwidth (Uplink) | Network latency | Environment monitoring
Xssist™ Group Pte Ltd Singapore Dedicated Servers Client Testimonials Blog Community Frequently Asked Questions Contact Page
Singapore Dedicated Servers
Control Panel System
Control Panel System
Xssist Blog

Buzzwords - Clusters, Clustering

It looks like "unmetered", or "unlimited", or 99.999% uptime is no longer the buzz word in web hosting. The buzzword now is clusters. If you look for details, they are no where to be seen. Clusters, what are they? Web hosting companies would have you believe clusters are so much more reliable than a single shared server, or even your own dedicated server. Is this so? "Clusters" by itself really have no meaning, it describes nothing or it means so many things that it really needs an explanation to make any sense.

Questions: What exactly is being clustered? How does it work, what does it do?

Chances are, the web hosting company have no idea other than its a buzzword to use in their advertising. To choose a good hosting provider, you need to separate the wheat from the chaff. To do that, you need to understand the basics.. clustering is the end solution to a problem. The problem is HA (High Availability), or how to achieve HA. Let's start with a single server. How do you reduce downtime, i.e. have it available more, or to have higher availability? There are many ways for the server to fail. For each failure that can cause the server to go down, we call it a Single Point of Failure or SPOF. There are very obvious SPOFs in a single server. They are as follows:

  • Power supply
  • Harddisk
  • RAM
  • Human (System administrator, programmers, users)

For power supplies, chances of failure can be reduced by having redundant power supplies. It is a misnomer of course. The redundant power supply is not really redundant, it is there so that if a power supply fails, the spare will continue to supply power to the server. Of course, if you are able to power dual sourced power to the power supplies, i.e. one power utility powers one of the power supply, another utility powers another power supply etc; and add to it, UPS, and generators, that will cover most of the problems the power might face. There's still the power train, but less likely to fail. Also a leakage to earth might trip a circuit breaker etc. In any case, an enterprise server must have redundant power supplies.

Harddisk. Easily the most crucial part of the server. Any other part can be replaced without losing data. Lose the harddisk, and we might lose data. RAID comes in here. Usually RAID 1, which is mirroring. For RAID, higher numbers does not mean better. In particular, RAID 5 is largely useless for web hosting for busy websites, as RAID 5 implementations usually performs very badly when there are multiple reads and writes. For servers spec-ed for enterprise, you need multiple storage cards connecting to multiple storage, which are mirrored, such that the storage card, i.e. the host bus adaptor (HBA) does not become the SPOF.

RAM for servers need ECC at a minimum, and goes on to RAID 1, and further on to hotswappable for the truly highend.

The human, otherwise known as the system operator, or sysadmin, or programmer etc etc, is unfortunately, one of the weakess link. This is the person who goes rm -rf / ohhh.. opps. OR trips over a power cable OR pulls out the wrong harddisk out of a RAID array. Training and experience is important here.

After all of the above is covered, and if the availability is insufficient, we can start thinking about clustering in the sense of having a duplicate server; i.e. a passive - active cluster configuration. For most of everything you have above, duplicate them, and have the equipment sitting around (passive) doing nothing except take over when your active server dies. How does the passive server (otherwise known as node) know to become active? Heartbeat comes in here. This can be in the form of a cross cable between the two nodes. Software on both nodes tries to communicate with each other, and if the communication gets cut off, the passive node tries to become active.. However, the heartbeat link can be a SPOF.. so duplicate that, and we have 2 heartbeat links.. oh, and if possible, please get separate NIC (network interface card) for each of the link. Then we run into the problem of storage. How to ensure the passive node has updated information? A shared storage comes into the picture, where each node have access to a shared storage. Then we run into the problem of ensuring the active node is not still accessing the storage, i.e. it is actually dead, and the passive node can take over the storage...

It goes on and on, and along the way, SAN (storage area network) gets introduced, or perhaps drbd for the DIY poor man version of clustering.

When do you failover a node? if you have mail, dns, web, pop, imap etc on a server, does your software try restarting the individual daemons, how many times does it try before failing over the node? does 1 failed service justify failing over the node? after failing over, how do you fail back ? or do you have 1 mail cluster, 1 web cluster, 1 pop cluster etc? how do you share storage?

The point is, most webhosting companies which does not explain what their clustering is doing, is just doing a poor man version without a good background, with cheap servers, without reducing the SPOF which should be fixed within the sever, and is a worse risk than if they provide you with shared or a single server. imagine: use NFS for mail spool? that's what some of these guys will try.

Clustering.. is a double edged sword. Do it well, and it just might work.. do it less than perfect, and you are at the brink of disaster.. everything looks great, and one day it just goes poof. where possible, keep things simple.

07 Feb 2008

[Sysadmin] Access to servers via mobile device and ssh
[Sysadmin] RAID 0 scaling on SCSI U320, Bonnie++ 1.93c benchmark results
[Sysadmin] TODO (Apr 2007)
[Sysadmin] Recover from mistakes in /etc/fstab or e2label usage
[Sysadmin] Server overloaded?
[Sysadmin] Server load high: CPU bound
[Sysadmin] Smokeping: deluxe latency measurement tool
[Sysadmin] Smokeping
[Sysadmin] Jul 08 to Oct 08 updates
[Sysadmin] Weak link - downtimes caused by the organic being
[Sysadmin] BIOS upgrades - uniflash - hotflash
[Sysadmin] Sizing for Virtual Private Server (VPS) & SSDs
[Sysadmin] iphone, ipod - bluetooth keyboard - Nokia e51
[Sysadmin] e2label, fdisk, /etc/fstab, mount, linux rescue, rescue disk, CentOS
[Sysadmin] opensuse, fix waiting for mandatory device, eth0, eth1, eth2, eth3
[Sysadmin] mount: could not find filesystem '/dev/root'
[Sysadmin] Parallels Virtuozzo Physical server to Container migration (vzp2v)
[Web hosting] DDOS (Distributed Denial of Service)
[Web hosting] Uptime for dedicated server, VPS and shared server
[Web hosting] Shared, Guaranteed and Dedicated Bandwidth
[Web hosting] Unmetered bandwidth
[Web hosting] Free domains?
[Web hosting] Joomla Scalability
[SPAM handling] Tracking applications which are exploited for mass spam mailing
[Buzzwords] Clusters, Clustering
[Security] Destruction of faulty hard disks
[Storage] Benchmark using iometer on linux
[SSD] Benchmark Intel X25-E and Intel X25-M flash SSDs
[SSD] Intel X25-E 64GB G1, 4KB Random IOPS, iometer benchmark
[SSD] Intel X25-M 160GB G2, 4KB Random IOPS, iometer benchmark
[SSD] Comparison of Intel X25-E G1 vs Intel X25-M G2
[cPanel] ClamAV version has reached End of Life! Please upgrade to version 0.95
[cPanel] How to install Java, ImageMagick and ffmpeg
[Perl] Opening text files for reading, and simple regexp (regular expressions)