Oh, the joys of computer hardware
Around 10am this morning, pell went poof and stopped responding to anything except for pings. This has happened before, so I suspect I know what the problem is, but, alas, back when I built the last incarnation of Pell (in 1998) I decided to buy the most reliable drives I could find -- a couple of 4.5gb IBM wide SCSI drives -- and, yes, they've been pretty reliable. I replaced one of them (a 2gb model) with an IBM lvd scsi drive sometime in 2000, but other than that they've been chirping happily along for the past eight years. (about the only thing that I've had die for certain has been various power supplies; most PC power supplies don't deal well with being stuffed in a machine room, turned on, and kept running until time_t rolls over.)
Well, it's been eight years for the root disk, and it's developed a few bad blocks. And occasionally the system attempts to read one of these blocks and bobbles a disk lock, which means that the disk turns into a rotating anvil and everything that wants to talk to it (or any other disk on the machine) comes to a screeching halt. Thus the crash. The fun is that now that these ridiculously reliable hard drives have failed I need to replace them, and you can't buy 4gb drives anymore unless you want to go to high-reliability server repair houses and pay ridiculous multiples of US$100 for them.
I could get IDE disks, but remember that pell is old and it runs a version of Linux (2.0.28+orc) that, um, doesn't take full advantage of ide disks. It does support dma (I backported dma from a 2.1 kernel when I started moving servers en masse over from scsi to IDE) but it doesn't support ultra dma(tm) or, I suspect, disks that are as large as the smallest of modern udma ide disks.
It would have been much easier if I used less reliable hardware; if I had flakey disks, I'd have to rev the disks every year instead of every decade, and that would force me to backport modern linux drivers to the 2.0.28 kernel (don't laugh at this idea. The two ancient kernel trees I've got are 1.2.13, which is 10mb, and 2.0.28, which is 31mb. The 2.4.21(+r*dh*t hacks) kernel I've been getting paid to maintain at work staggers in at 213mb after I've deleted the objects, and the 2.6.9(+r*dh*t hacks) kernel I'm replacing the 2.4.21 kernel with comes in at 304mb. It takes a lot less time to recompile 2.0.28 than it does to recompile 2.6.9, even without compiling the driver modules for 2.6.9) To add insult to injury the newer versions of gcc that are recommended for compiling the newer kernels have "fixed" asm() support (and by fixed, I mean broken; the poorly documented changes in asm() syntax thoroughly break every bit of assembly that can be found in libc 4.8.x)
So it's become a big problem, and all because I chose reliable hardware. Sure, it's nice to look at six years of operation and realize that, aside from transport time to get to and from the colo, the downtime can be measured in a handful of hours, and even when I count travel and lockout time, it's still less than three days (pell has been very good at dying just after I get on the bus to go home, and I'm not going to turn around and come back downtown to fix it that night), but when things start to go wrong and you look at spending an order of magnitude more money to replace the reliable parts? Well, lets just say that 5 nines is overrated, and you may make up for the savings by paying through the nose when the warranty finally expires.