[ale] disk drive diagnostics nirvana - NOT - I have questions

Phil Turmel philip at turmel.org
Tue Oct 23 12:11:08 EDT 2012


Hi Ron,

[trim /]

> If you were so inclined, SpinRite can be used on any HDD that the
> computer's bios can see and properly control.  It boots as a stand alone 
> executable with the OS not running.  I use it on my Linux partitions the 
> same as I do on my Windows partitions.  SR works at the sector level.  
> It doesn't care what's on the sectors.  One of my computers has BIOS so 
> old (2002) it cannot see all of the 320 GB drive I have in it.  I use 
> badblocks on that.  Once Linux is booted, it doesn't care about the 
> limits of BIOS.  By the way, in recent podcasts, Steve Gibson, inventor 
> of SpinRite, has indicated that using it on SSD drives, in read only 
> mode, can help the SSD be more reliable and sometimes recover finicky 
> data.  Apparently, the scrubbing has benefits there too.

My point was that I never take those systems offline.  Reboots for new
kernels is the only deliberate downtime.  The last four+ year old drive
I replaced had a power on count == 7.  Spending more than five or ten
minutes offline is the problem, not SpinRite's support for various formats.

> I like the scrubbing idea.  Essentially what I'm doing manually.

Really important to find and fix UREs--Unrecoverable Read Errors--which
most manufacturers specify as happening in 1x10^14 bits.  That's 12TB.
Read your 3TB drive four times, and you're there.  You really need raid
to supply the missing sector in a live setup, or offline Reed-Solomon
(like par2).

[trim /]

> I've always wondered why the surface of a drive that hadn't crashed 
> would deteriorate, and even why the servo mechanism would go wonky.  
> But, what you're saying about the spindle bearing makes some sense.  I 
> can see how that could cause errors.  These drives have been very 
> lightly used, except for running 24 / 7 when the weather is good.  As I 
> said in my original post, as far as I can tell, I can read and write to 
> all the remaining sectors.  However, with something like a bearing 
> spinning at 7200 RPM, even a slight intermittent problem could quickly 
> degenerate into a catastrophic problem.

Like many other products built on microscopic structures, the
manufacturing processes are mostly chemical, and are very susceptible to
particulate contamination.  Some of these flaws are identified in the
factory and mapped out.  Others are just weak spots that lose their
magnetic orientation faster than their neighbors.

> This idea of mechanical failure reminds me of a contact I had years ago 
> with someone in the field of industrial equipment reliability.  They had 
> this really cool test system where they could measure the ultrasonic 
> signature of a motor in a factory and predict failure months in advance, 
> allowing preemptive replacement.  I wonder if you could do such a thing 
> with hard drives reliably.

You certainly can.  But most industrial motors run at 1800 rpm, not
7200rpm, and time-to-failure after displaying vibration problems is at
least an order of magnitude shorter.  Error rates in the SMART data will
show up first, I believe.

(My day job is designing and upgrading industrial control systems, so
I've seen a great deal of this kind of stuff.)

> By the way, how could I do background read only scrubbing in Windows and 
> Linux such that each sector is read at least every 1-3 months while the 
> OS is in use.  None of my drives are RAID.

Trigger long self-tests with smartctl under linux.  I don't know what to
use in Windows.

As for using raid or not, just remember that drives sometimes die
without any warning.  If some of the contents are nowhere else, kiss
that material goodbye.  Raid is not a backup, of course, as it provides
no protection from deletion/operator error, but it is better than no
backup at all.

Phil


More information about the Ale mailing list