Responding to Disk Failure

By Paul Murphy, author of The Unix Guide to Defenestration

Bad things happen to the nicest disks. Specifically, its not that unusual for a large disk tottering toward failure to signal its intent first by developing some bad blocks that show up in your console window, or system log, as read or write failures.

When a disk develops bad blocks in normal operation you usually have from minutes to months to replace it and no real information on how urgent that fix is. That's no big deal if that disk is mirrored or otherwise part of a RAID group, but of course the disk that goes bad first is usually the one you decided to bet on.

The first step you should take in response to sector read/write errors is to order a replacement drive and decide when to shutdown for installation. Meanwhile, however, you need to keep the system going until that happens.

Unless the bad sectors are in a truly unfortunate place on the disk the addbadsec utility will let you lock those sectors out and thus prevent the system from hanging on an attempt to read or write them. That works, but also locks out access to the data stored there and thereby, of course, damages file system integrity.

If its not your boot drive, you can use fuser to find and kill processes accessing the particular file system involved; unmount it, run "% fsck -y |& tee record" to fix it, remount it, and then recover the affected file or files from backup using the information in the "record" file to do it.

Depending on what else the system is doing, it usually makes sense, too, to bring the system to single user state (init 1) before starting and returning to normal operations (cntl-d) when finished.

Just remember: the only thing you know for sure once block read/write failures start to appear is that it will get worse: thus any action you take pending disk replacement will be palliative at best.