A chain is only as strong as its weakest link. On 2 separate occasions in the past 2 months, the sysadmin proves to be the weakest link.
Similar problem in each case, faulty RAM in the server. We had to replace the RAM. Unfortunately, tracing 1 faulty RAM module out of
several.. eg. 16 modules in a 32GB server with 2GB modules can be more difficult than it seems. The OS might indicate which CPU and which
slot has the ECC error. The server might has a fault panel that points out the DIMM slot. However, sometimes we just have to swap modules
in and out.. and each time having to plug out the power cord, KVM cables, network cable, SCSI cables, etc, pull out the server on its
rails, open the casing, swap the RAM, put everything back and power on the server. If it works, good, if not, repeat all of the above.
On one occasion, during booting, error messages start coming up from the SCSI card.. oops. SCSI cable not connected.. sysadmin starts
panicing.. what if the configuration gets corrupted? quick, pull out power cable.... hmmm... screen continues scrolling.. OOPs.
pulled out power cable of another server.
On a separate occasion, need to replace RAM in a server, and install a APC fan, ACF002, for the rack to improve air flow. Pushed the fan
module too far back into the rack.. managed to dislodge an external SCSI cable, of a server, of which the screws were not tighted
(lesson learnt here: tighten all screws for cables, especially for critical ones like the SCSI cable). We did not realise the SCSI cable
was dislodged. First sign of trouble was the fault light turned on for 1 drive out of 6 for a SCSI array, configured to RAID 10.
Okay, that's not too bad.. just a drive failure right? okay, let's just replace the faulty SCSI hard disk.. pull out faulty disk,
insert new disk. fault light changed status, and the disk starts syncing, or so I thought. Then.. the fault lights for 4 out of the
6 drives turned on. Ouch. 4 drives out of 6 is good enough for data loss even for RAID 10. Restarted the server, and accessed the
RAID card menu.. pulled out 3 of the drives marked faulty, and replaced them, tried marking the remaining faulty as good, i.e. 1 drive
out of each RAID 1, of the RAID 10 will be marked as good, RAID array will be ok. The RAID card could not detect some of the drives.
getting worse and worse.. Checked the SCSI cable.. aha.. its dislodged.. pushed the cable in, tightened the screws.. marked 3 of the
drives out of 6 as good.. worked! resynced the other 3 drives as well.