This Space for Rent

Supertrivial kernel hack of the day

Modern Linux kernels (and possibly ones as old as the 2.0.28 kernel I use with Mastodon Linux, but I wouldn't know because I don't load off initrds, so if you try to install the OS onto a machine that doesn't have a supported hard disk controller, it won't even get off the ground) have the almost completely desirable feature that if they load and then can't find a root filesystem, they immediately panic() with a fairly detailed error message. This is a good thing, with only one teeny problem, and that problem is that the machine calls panic() to shut itself off.

When the init process calls panic(), it pretty much locks up the machine. You can't do the 3 finger salute to reboot the machine, scrollback dies, and, most importantly, there's a kernel panic message on the screen, which tends to concentrate the mind in a distressing manner. In my career as a Linux programmer (since 1993) I've played through a little skit involving the kernel panic message more times than I care to think about:

"Something went wrong with the system! When we tried to boot up, it kernel panicked; you need to drop everything and find out what's wrong with the kernel!"
"What else did it say?"
"I don't know; it just kernel panicked!"
"Can I take a look at the machine?"
"Oh, no, we're too busy for that; we had to reboot the machine into other-OS to do our work."

And almost invariably, the reason for this kernel-panic is because the machine is trying to load off an nonfunctioning initrd, and when it can't load a root filesystem, it panics as you'd expect it would.

The messages on the console read something like:

scsi0:A:0:0: Tagged Queuing enabled.  Depth 32
(scsi0:A:6): 80.000MB/s transfers (40.000MHz, 16bit)
  Vendor: IBM       Model: DDS Gen5          Rev: A060
  Type:   Sequential-Access                  ANSI SCSI revision: 03
Module aic79xx loaded, with warnings
LVM version 1.0.8-2(26/05/2004) module loaded
Loading lvm-mod.o module
Warning: kernel-module version mismatch
        /lib/lvm-mod.o was compiled for kernel version 2.4.21-37.EL-2
        while this kernel is version 2.4.21-37.EL-3
Warning: loading /lib/lvm-mod.o will taint the kernel: forced load
  See http://www.tux.org/lkml/#export-tainted for information about tainted modules
Module lvm-mod loaded, with warnings
Mounting /proc filesystem
Creating block devices
VFS: Cannot open root device "%s" or %s
Please append a correct "root=" boot option
Kernel panic: VFS: Unable to mount root fs on %s

But what the users see is something more like:

Blah! Blah! Blah! Blah! Blah! Blah! 
Blah! Blah! Blah! Blah! Blah! Blah! 
Blah! Blah! Blah! Blah! Blah! Blah! 
Blah! Blah! Blah! Blah! Blah! Blah! 
Blah! Blah! Blah! Blah! Blah! Blah! 
Blah! Blah! Blah! Blah! Blah! Blah! 
Blah! Blah! Blah! Blah! Blah! Blah! 
Blah! Blah! Blah! Blah! Blah! Blah! 
KERNEL PANIC!!!!
Blah! Blah! Blah! Blah! Blah! Blah! 
Blah! Blah! Blah! Blah! Blah! Blah! 

And thus it gets reported as just a kernel panic (because kernel programmers are apparently coding gods™ who can diagnose bugs with a swish-swish) which we are expected to be able to faithfully reproduce on our entirely different hardware, and, at least half the time, has lead to bugtracking database entries that escalate up to being very high priority (and boy, lemme tell you I'm happy I don't have that job anymore) issues that mean we have to drop everything to get them fixed.

We've been testing out some new hardware at my current job, and have been going through the kernel panic skit at regular (infrequent, but regular) intervals, and I finally snapped (patch vs. Linux 2.4.21):

--- linux/init/do_mounts.c~     2006-01-27 11:45:22.000000000 -0800
+++ linux/init/do_mounts.c      2006-01-27 12:27:28.000000000 -0800
@@ -376,12 +376,23 @@
                  Allow the user to distinguish between failed open
                  and bad superblock on root device.
                */
-               printk ("VFS: Cannot open root device "%s" or %s\n",
-                       root_device_name, kdevname (ROOT_DEV));
-               printk ("Please append a correct "root=" boot option\n");
-               panic("VFS: Unable to mount root fs on %s",
-                       kdevname(ROOT_DEV));
+               break;
        }
+
+       printk ("\n"
+               "********************************************\n"
+               "*              SYSTEM HALTED               *\n"
+               "********************************************\n");
+       if (root_device_name)
+           printk (" Cannot open root device "%s" or %s\n",
+                   root_device_name, kdevname (ROOT_DEV));
+       else
+           printk (" Cannot open root device on %s\n", kdevname(ROOT_DEV));
+       printk (" Please append a correct \"root=\" boot option\n");
+       printk ("\n********************************************\n");
+
+       while (1) schedule_timeout(MAX_SCHEDULE_TIMEOUT);
+       /* should _never_ happen */
        panic("VFS: Unable to mount root fs on %s", kdevname(ROOT_DEV));
 out:
        putname(fs_names);

This will hopefully kill multiple birds with one stone; not having the panic means that at least the users will report a different mysterious system error to us, plus because it's spinning on schedule_timeout, the three-finger-salute and the built-in linux scrollback features keep working, so on the rare occasions where I can actually get to the crashed machine before the users have gone away to do something else, I'll be able to scroll back towards the top of the kernel messages to start and figure out what went wrong.

Perhaps the novelty of the new error message will startle the users into paying attention to it.