[ale] disk drive diagnostics nirvana - NOT - I have questions

Phil Turmel philip at turmel.org
Mon Oct 22 21:26:10 EDT 2012


Hi Ron,

On 10/22/2012 06:12 PM, Ron Frazier (ALE) wrote:
> Hi all,
> 
> I've spent the last couple of days doing disk diagnostics on all my hard 
> drives, which I do periodically, and learning more than I really wanted 
> to know about sector errors.  I'll try to share more details later, but 
> for now, I'm just going to post the minimum.  As you may know, a HDD 
> that works perfectly from the factory may develop problems over time and 
> show either bad (reallocated) sectors or bad blocks.  Since the HDD 
> controller can usually only discover read / write problems when you 
> actually access the sector, I've developed a practice over the years to 
> read and write every sector on the hard drive a few times per year.  I 
> usually use Spinrite, which can operate on Windows or Linux drives.  It 
> boots as a free standing executable.  In the mode I use, it reads, 
> inverts and rewrites every sector on the disk, then does it again.  This 
> forces the drive's controller to find and remap any weak sectors to 
> somewhere else while they can still be read properly.  If the sector 
> doesn't read, Spinrite uses advanced statistical algorithms to try up to 
> 2000 times to recover the data.  You can also do something similar with 
> badblocks -nsv in Linux, except for the bad sector recovery, although I 
> don't know exactly what algorithm it uses.  On a large drive, these 
> tests take days to complete.  Once they're done, I know that the drive 
> can absolutely read or write any sector reliably, or if it couldn't, 
> those questionable sectors should have been reallocated to other areas 
> by the controller.  The first thing I do when I get a new drive is write 
> it with random data then Spinrite it about 6 times to thoroughly burn it 
> in.  I then follow up with one such procedure every 4 - 6 months.

I only run Windows in VMs nowadays, and my critical servers (home media
server and my small business server) are 24/7 linux, so spinrite isn't
for me.

My critical servers all use linux software raid in various combinations,
and all of the raid arrays are scrubbed weekly.  By scrubbed, I mean a
cron job instructs the kernel to read every sector on every member
device in the background, compute parity as appropriate, and report any
inconsistencies.  Any read errors trigger the corresponding recovery and
rewrite functions that would normally occur if an application
encountered the sectors.  Any unsuccessful write kicks that device out
of the array as usual.

I have been doing this for about ten years now, with about seven or
eight drive failures in that time.  Never lost any data, though I've
been nervous a few times when waiting for a replacement disk for a raid5
array.  Everything is now raid6 or triple-mirrored, so I sleep well.

> What usually happens is:
> 
> * Run file system check.  No problems, or minor problems fixed.
> * Run Spinrite or badblocks.  No read write errors.
> * Follow up by checking SMART data using Disk Utility or GSmartControl.  
> (PS, Disk Utility will not show SMART data on a USB drive due to a bug, 
> but GSmartControl can.)  No bad sectors and no pending reallocations.
> 
> I have two 1 TB drives that I use for backup.  I backup to one then 
> mirror it to the other.  I recently had occasion to completely read one 
> and write the other in a mirroring process.  As far as I know, there are 
> no read or write errors.  When I ran the SMART check, I found that one 
> of these has 12 bad or reallocated sectors and the other has 120.  This 
> prompted me to start the Spinrite process on one, which I haven't 
> finished, to read, invert, write, read, invert, write the data.  I could 
> have used badblocks as well.  I've finished 72% of one drive, and, thus 
> far, have had no read or write failures or bad blocks reported.
> 
> So, the $ 600,000 question is this.  Assuming every active sector on the 
> drives can be successfully read and written, should I be concerned about 
> 12 or 120 bad reallocated sectors?  I find a wide variety of opinion on 
> the net ranging from not a problem all the way to replace the drives 
> immediately.  Note that these are my backup drives for this PC, so I 
> REALLY don't want them to fail.  The drives may be more than 5 years 
> old.  I'd have to dig through receipts.  However, they're showing a 
> powered on time of 2.1 years.
> 
> Let me know what you think.

All of the drives that failed on me had fewer than 100 relocated
sectors.  None of them had fewer than 20 relocated sectors.  Mostly
30,000+ hours of operation.  This seems to correspond well to the
reports I read on the linux-raid mailing list.  I tolerate drives with
single-digit relocation counts, but I recheck them every week.  After
that, they're outa there.

Some of the research on the topic suggests that climbing relocation
counts is most often caused by approaching spindle bearing failure,
where the wobble causes head tracking errors.  Whatever the underlying
reason, that's my red line.

HTH,

Phil


More information about the Ale mailing list