Hard Drive Story

So my MacBook was just sitting on a desk, idle. It was showing a screen saver or something. I woke it up and was greeted with a spinning beachball of death. This seemed to be an unusually persistent hang, so I killed the power and rebooted. I get a flashing system folder, no OS found. I reboot from a Leopard DVD and run disk utility. My HD does not even appear on the SATA bus. Around this time I notice a very quiet repeated clicking coming from the HD and I get a bad feeling about the dreaded “click of death“. I power off and reboot from an external drive, but still no sign of the internal one. It seems the drive is dead.
Now is the time to contemplate my backups. I have some recent enough Time Machine backups on an external drive, so I restore one of those onto the other external. That all works, but when I reboot into it and try to log in, I get an odd error about not being able to log in due to a FileVault problem, which I was indeed using. A bit of googling reveals the sad truth: Time Machine and FileVault do not play nicely together. Or more to the point, Time Machine just ignores everything protected by ‘FailVault’, that is to say if backs up everything except the bits that you actually want backed up. It turns out that it will back stuff up, but only if you log out. Hm. Logging out is for people that want to live without the instant-on we’ve come to love from Apple, and don’t mind waiting 5 minutes to for their 10 apps to launch on login, i.e. those that don’t have laptops.
So I extract the internal HD and ship it off to a data recovery company. After a day or so I get the bad news:

We regret to inform you that we have tried our very best to recover data from the faulty Hard Disk Drive you sent to us, unfortunately we were unable to recover any data.

Please accept our apologies for not bringing our services to your satisfaction, despite the fact that from a technical point of view, the damaged inflicted on HDD is beyond repair and it is absolutely impossible to recover any data as the HDD had suffered from severe internal mechanical failure accompanied by media damage, therefore the extreme nature of the damage made it impossible to recover any data.

Despite the usage of different components to get the HDD to spin, the internals were too damaged to read any data from the HDD. The effect of the media damage is immediate on the magnetic information stored on the drive, jeopardising the stored data files and the logical structures.

You can view the scratched area of the hard drive platter in the attached picture, it is the thick dark ring you see running around the inner part of the platter.

I’ve attached that very picture and you can see there is some fairly obvious mechanical damage to the drive – a great (for want of a better word) example of a head crash. The chances of a 50Gb encrypted volume file surviving that intact are pretty slim, as they say.
I’m pretty surprised that a modern, state-of-the-art hard drive can suffer spontaneous catastrophic damage like this without having experienced any physical shock. For those that want to know (so you know what to avoid), the drive was a 160Gb Seagate Momentus 5400.3, a bit under 2 years old. It’s still under guarantee, but a fat lot of use that is now.
I need to reconsider my backup and encryption options…

3Ware RAID rebuilding

I’ve had the dubious honour of seeing some RAID failures and rebuilds lately. It’s the kind of thing that doesn’t get written about in the manuals very well, in particular what your RAID will report when it’s having trouble. So, here are a couple of examples from a 3Ware RAID controller using tw_cli software. This is what tw_cli /c4 show displays when we have a dead drive:

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-1    DEGRADED       -       -       -       149.05    ON     -      

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     149.05 GB   312581808     G2109NHG            
p1     DEGRADED         u0     149.05 GB   312581808     G20X1BWG            

So, we swap the drive, and it looks like this while rebuilding:

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-1    REBUILDING     89      -       -       149.05    ON     -

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     149.05 GB   312581808     G2109NHG
p1     DEGRADED         u0     149.05 GB   312581808     G209Y0HG

and after a little while…

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-1    OK             -       -       -       149.05    ON     -

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     149.05 GB   312581808     G2109NHG
p1     OK               u0     149.05 GB   312581808     G209Y0HG

There are plenty of obvious strings to match in this output (though there are many other reports available), so it’s a reasonable thing to base a monitoring script on.

It’s nice to see it actually work, and makes me extremely grateful that I bothered getting RAID n the first place. This would be a much unhappier post if I hadn’t.