I’ve had the dubious honour of seeing some RAID failures and rebuilds lately. It’s the kind of thing that doesn’t get written about in the manuals very well, in particular what your RAID will report when it’s having trouble. So, here are a couple of examples from a 3Ware RAID controller using tw_cli software. This is what tw_cli /c4 show displays when we have a dead drive:
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-1 DEGRADED - - - 149.05 ON - Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 149.05 GB 312581808 G2109NHG p1 DEGRADED u0 149.05 GB 312581808 G20X1BWG
So, we swap the drive, and it looks like this while rebuilding:
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-1 REBUILDING 89 - - 149.05 ON - Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 149.05 GB 312581808 G2109NHG p1 DEGRADED u0 149.05 GB 312581808 G209Y0HG
and after a little while…
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-1 OK - - - 149.05 ON - Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 149.05 GB 312581808 G2109NHG p1 OK u0 149.05 GB 312581808 G209Y0HG
There are plenty of obvious strings to match in this output (though there are many other reports available), so it’s a reasonable thing to base a monitoring script on.
It’s nice to see it actually work, and makes me extremely grateful that I bothered getting RAID n the first place. This would be a much unhappier post if I hadn’t.