RAIDers of the lost sleep

Early in my career, I noticed an error when I rebooted a server, saying that one of the RAID drives had failed. The server was able to keep running, but the drive needed to be replaced, so one of my colleagues came over with a new one. The drive was hot-swappable, so he was quite cheerful about the fact that we wouldn’t need to shut the server down first. However, we disagreed about which drive had failed; the error message referred to drive 2, and there were 5 in total, but I thought that the numbering would start at 0 while he thought that it would start at 1. He outranked me, so he pulled out the second drive. Unfortunately, this turned out to be the wrong one (i.e. one of the working drives), so the entire server crashed, and we had to spend our Friday night re-installing Windows from scratch.

So, be aware that computers start counting at zero, and it’s prudent to know which RAID drive is which! (I put numbered stickers on the drives after that.) Nowadays, servers normally have LEDs on each drive, so that you can identify the faulty one because it’s amber rather than green, and that makes life easier. On the whole, I think that RAID works pretty well, and I have been able to hot-swap drives since then without any downtime, so none of the end users have even realised that there was a problem.

Yesterday, I noticed a disk failure in another server: one drive from a mirrored pair (RAID 1). That server is under a support contract, so I got Dell to send out a replacement by courier. I installed the new drive, and it started to rebuild; it’s a 36 GB drive, and it seems to copy data at about 1 GB/minute. When it was about halfway through, the other hard drive also failed (i.e. the one I was copying data from). So, I had two hard drives, but neither was operational, and the server went splat. At that point, I had to re-install the operating system from scratch, which took a lot longer. The frustrating thing is that if the second failure had happened 20 minutes later, I would have been fine, because the rebuild would have finished by then. Instead, it took me another 11 hours to fix the problem, which is why I’m typing this at 3am.

Quick tip for anyone else in this situation: when you receive a replacement drive, install it immediately. If you’re on your way out to lunch when the courier arrives, install the drive and eat later. If you’re on your way to the toilet, cross your legs while you install the drive. If you need to get your boss to sign some paperwork before he leaves for the day, install the drive first and sort the paperwork out tomorrow. You get the idea…

In a case like this, it may also be prudent to keep a cold spare around, i.e. an identical (or bigger) hard drive sitting on a shelf in the server room. That way, you can replace the failed drive immediately, and when the courier turns up to swap your failed drive for a new one, that replacement drive will become your new cold spare. You still risk the same problem, i.e. if both drives in the pair fail before you’ve finished the first rebuild then the server will die, but you can reduce the amount of time that you have a solo drive if you eliminate the 4 hour wait for a courier. In that case, the same principle applies as above: as soon as you know about the failure, installing the cold spare should become your immediate Top Priority, taking precedence over everything else.

The snag is that you will probably need to buy your cold spare from the same company (e.g. Dell); you won’t void the warranty by installing a different drive, but they won’t replace a drive that they didn’t supply. That’s understandable, since they aren’t responsible for a different company’s manufacturing defects, but it will almost certainly be more expensive than buying a similar drive from a “normal” supplier (e.g. Dabs or Misco), maybe even twice the price. You will need to assess the impact of extended downtime if the server fails and decide whether it’s worth the money.

Leave a Reply

Your email address will not be published. Required fields are marked *