Articles March 29, 2006 at 12:17 pm

FSCK-ing a Big Storage Disk ‘On-the Fly’

Verifying large volumes can be a real pain, especially on systems that need to be up as much as possible.

Here’s some ideas on how to make this as painless as possible.Let’s assume that your OS X Server runs several common services: NAT, DNS, DHCP, Mail, and Web. Let’s also assume that you have a mailstore (cyrus imap if you are keeping up with the Joneses) on a second physical disk. Finally, let’s assume that you have reason to believe there’s some data corruption on the second disk, which we will call /Volumes/Storage.

The way we’d fix this on a client would be to boot in single-user mode (apple-s) and run fsck -fy. However, we have to remember the services we have running on our server. Some of these services (such as DNS and especially NAT) really need to be running 24/7.

So we’ll do this ‘on-the-fly’. First, we need to stop any services that may need access to /Volumes/Storage. I prefer to do this with the command-line ‘serveradmin’ tool. I log in to the server over ssh and type

sudo serveradmin list

to get the list of available services. Then I type

sudo serveradmin stop [servicename]

for each of the services that need to access /Volumes/Storage. Fortunately, unless I’ve done some crazy modifications, I won’t need to shut down NAT, DNS, Open Directory, or DHCP since their associated files are all on the boot drive.

WARNING: There’s a good chance that you have network homes on /Volumes/Storage. If you have any users USING their network homes, suddenly stopping the AFP service will wreak havoc. Make sure nobody is using AFP when you turn it off.

The good news is that most users can live with a short lapse in service for Web and Mail services.

Now that no services are accessing /Volumes/Storage, I can proceed to unmount (NOT eject) the volume prior to repairing it.

sudo hdiutil -eject /Volumes/Storage

if the volume refuses to unmount and I am SURE that I can get forceful with it, I can run

sudo hdiutil -eject -force /Volumes/Storage

Ed. Note: You can also use the lsof or fs_usage commands here to determine what processes are keeping the files open. Terminate the process in question and the volume should unmount cleanly.

then

df

to get the device name for /Volumes/Storage (usually something like /dev/disk1s3)

then

sudo fsck_hfs -f /dev/[disk id string]

(the -f options forces fsck_hfs to scan the drive even though it’s journaled)

After the scan is over (it took me about 30 min on my test box 450mhz G4 with a 500gb RAID) I can remount the volume

sudo hdiutil mountvol /dev/[disk id string]

Then I repeat the serveradmin commands I ran earlier, replacing stop with start

sudo serveradmin start afp

And so on until all my services are back up.

You could do this with Disk Utility, but I feel this method is a bit more reliable (since, for example, the admin account on your server may have a home folder on the volume you are trying to unmount). Also, Disk Utility doesn’t have the -force option that hdiutil has.

(Ed. Note: We find this an interesting way to check a disk and seeing someone use fsck on a volume, other than boot, is nice. In practice though I’ve always performed all of this with serveradmin and diskutil unmount (Which has a force option and allows me to unmount volumes or disks, unmountDisk.), repairVolume (Which calls the appropriate fsck utility.), and diskutil mount or mountDisk. I do find the usage of hdiutil rather clever here as I would of never thought to use that to touch physical disks! Different strokes…)

Leave a reply

You must be logged in to post a comment.