This Space for Rent

kernel programming fun

At work, we're, um, stressing the poor Linux kernel. If you have a lot of users running Pick, even a really big system goes completely to hell. One of the symptoms of the machine going to hell is it gets into a canoe where 9000 or so processes are beating away at the machine trying to get pages, and eventually a process like lpd tries to fork, can't reserve memory for the half a dozen picky little things that fork wants to copy, and bombs out even though we've still got 15gb free in swap, because if I'm reading the 2.4 kernel properly when kmalloc() can't get a block because there's no free core [or not enough free core; the 2.4 kernel never completely exhausts core, but gets to within 10mb of exhaustion and starts pretending that there's no more core], it kicks kswapd and puts the forking process at the end of the task queue, hoping that when it comes back around it will have the 4 or 5 pages it wants.

Well, with 8999 other processes in the system, kswapd may have pushed out some old pages (at this point the system has driven itself 20-30 seconds into swap), but some of the other 8999 processes appear to be grabbing the pages away for their own nefarious purposes (because the system would rather give them these brand new pages instead of pushing other pages into swap), and when our hero the forking process comes back around, those pages that kswapd freed up for it have been taken away by one of the other 2000 copies of pick. So fork() fails, lpd complains, waits a second, then tries to fork() again.

Where it starts to get really fun is when this happens on a process that was written with the assumption that the machine has infinite memory and thus fork() will never fail. So if it fails, EOF is returned to the program, which then eventually sends a signal to pid EOF. kill() loves to send signals to EOF, particularly when you're root and nobody can resist your evil intentions.

So I've been trying to harden fork() so that it will be a bit more stubborn about failing, with a complete lack of success; first, I can't completely harden it because the Linux boxes we've got won't allow more than ~15000 processes at a time (I suspect that's because that's all that will fit in a 32 bit kernel process), and secondly, because the whole idea of Unix is that system calls can fail and you need to retry the call when it fails.

The first pass was to simply retry each kmalloc() inside do_fork(). It didn't seem to fail, but it didn't seem to make any difference to the failure cases on dumb synthetic tests. The second pass was to have dofork() yield the timeslice up to 12 times to give additional chances to drive pages out, which didn't seem to fail, but didn't seem to make any difference.

The third attempt was far more amusing; instead of just yielding the processor, I'd just put the offending process to sleep for a second, then retry the allocation. At least that was the plan. In reality, what ended up happening was the machine filled up with test programs, stopped forking completely, and appears to have converted into Shelly Winters ca. fall 2002.

I don't think I'll be putting that version of the fork hardening onto any customer sites.