[ale] read after write verify?, data scrubbing procedures

Fri Oct 26 18:13:34 EDT 2012

On 10/26/2012 04:21 PM, Phil Turmel wrote:
> On 10/26/2012 02:26 PM, mike at trausch.us wrote:
>> On 10/26/2012 09:16 AM, Phil Turmel wrote:
>>> The catch that some people encounter is that some of the metadata space
>>> is wasted, and never read or written.  If a URE develops in that area,
>>> no amount of raid scrubbing will fix it, leaving the sysadmin scratching
>>> their head.
>>
>> Eh, yeah, but I pull the member first and ask questions later.  The way
>> that I see it, if a drive in a RAID has failed, I don't have time to
>> scratch my head and find out why it failed, I have only the time to
>> replace it.  The questions come later, when I dig around logs (both the
>> system and the drive) and usually the answer is clear from the drive
>> logs alone...
>
> UREs by themselves are *not* signs the drive is failed.  On modern
> drives, spec'ed to 1x10^14, they happen all too often.  (Four complete
> read passes through a 3T drive ~= 1x10^14 bits.)

Well, maybe I'm just lucky then.  When scrubbing sends me cries for 
help, I swap the drive out.  I have yet to swap a "good" one out of the 
array, actually.

> I scrub my drives every week, and I'm not replacing the 3T drives every
> month.  Nor will the manufacturer take them on warranty for UREs.

Yeah, that'd be silly.  After this array had all new drives, only one 
failure has been experienced since then.  I expect that this array will 
actually outlive its utility, as we're considering upgrading due to 
running low on available storage space.

>> Those 30 seconds are 30 seconds I will not forget.
>>
>> Fortunately, the three drives were the last ones out of the original
>> set.  They were known that they were going to fail.  But the swap-out
>> schedule got held up for some reason I no longer recall and the
>> drives---which were supposed to all be replaced within one year of
>> deployment---had lasted about 19 months.  (They were horrible choices
>> for a RAID, but they were cheap.  "Re-manufactured", "green" drives.)
>
> This is scenario is extremely common with cheap drives due to a mismatch
> between controller timeouts and internal drive error recovery timeouts.
>   Standard desktop drives have extremely long error recovery algoriths,
> on the order of two or three minutes.  Linux controllers have a default
> timeout of 30 seconds.  The following scenario happens when creating an
> array from "green" drives:
>
> 1) Drive A experiences a URE and tries to recover it,
> 2) Controller for A times out after 30 seconds and reports the error,
> 3) MD Raid reads another mirror for that data and
> 4a) Supplies it to the caller,
> 4b) Tries to write the data back to A,
> 5) Drive A is still busy recovering and fails to respond to the write,
> 6) Drive A is kicked out of the array as "failed", array is degraded
> 7) Spare drive Z is added to the array and a rebuild started,
> 8) Drive B experiences a URE and tries to recover it,
> 9a) On raid5 or single mirrors, rebuild stops, data is lost.
> 9b) On raid6 or multi mirrors, #2-#6 repeat
> 10) (raid6 or mirrors+) Drive C experiences a URE...
>
> Mind you, this happens with *good* drives that just happen to have UREs
> within the span of a rebuild.  With modern drive sizes, this is very likely.

Perhaps this is another reason that the "general" recommendation is to 
use drives that aren't as big as you can get your hands on.

> Users of green drives in raid who never knew to scrub their arrays are
> often burned by this, as after months of operation most drives have at
> least one weak spot.  Then they scrub, or have a real failure on one
> drive, and all hell breaks loose.

I cannot imagine why someone would have an array of any sort and _not_ 
scrub the data.  How else are you supposed to catch failure early, 
without losing the customer's data to recklessness?

>> In case anyone's curious, the original plan was to swap out 1 drive per
>> quarter, except for the last two drives which were to be swapped out a
>> month apart.  12 months was supposed to be the longest any of them were
>> there...
>
> Except for one early death, all of my non-green drives have lasted 3-5
> years of 24/7 duty with weekly scrubs.
>
> I researched the above scenario after I replaced a couple old Seagate
> drives with newer, larger ones, that happened to not offer the timeout
> adjustments the old ones had (SCTERC).  A few months later they both
> dropped out of a raid6 during a scrub.
>
> The only manufacturer of consumer/desktop drives that still supports
> SCTERC is Hitachi, FWIW.

Interesting.  Though I stopped looking at the brand names of hard disk 
drives a long time ago, when they became true commodity items.

>>>> ... are inversely proportional to just how much you actually attempt to
>>>> protect your data from failure.  :-)  And being that I have backups in
>>>> place, I'm not terribly worried about that.  Drive fails?  Replace it.
>>>> Two drives fail?  Replace them.  Three or more drives fail?  Recover it.
>>>>     I get a much larger paycheck that week, then.
>>>
>>> :-)  I'm self-employed.  I get a much *smaller* paycheck when I spend
>>> too much time on this.
>>
>> Hrm.  Bill hourly!
>
> ?  Bill myself hourly?  I'm an engineer, not an IT contractor.  I do IT
> for myself and my own small business.  Time spent on IT is time *not*
> spent on engineering.

Ahh, makes sense.  Call me!  :-)

>> Flat-rate is high-risk, and I'll only do it for insane values of "flat
>> rate".  Pay me $25,000 per month, and I'll become your dedicated support
>> dude, no questions asked, and assign all my other work to someone else.
>>    That's about the smallest flat rate I'd take.  :-)
>
> We're getting a bit OT here, but I arrange for most of my engineering
> work to be paid a fixed fee per project.

Yes, a lot of people work that way.

I've been burned with flat-fee style, and so it's easier (for me) to 
just go with "I charge for my services by the hour" instead of "I'll do 
X for $Y" and then something goes horribly wrong and then I'm committed 
to still doing X for $Y, even if the fault lies with the client that 
I've now spend three times the time required for... well, you get the 
picture.

>>>>> par2 is much better than md5sums, as it can reconstruct the bad spots
>>>>> from the Reed-Solomon recovery files.
>>>>
>>>> Interesting.  Though it looks like it wouldn't work for my applications
>>>> at the moment.  Something that can scale to, oh, something on the order
>>>> of two to four terabytes would be useful, though.  :-)
>>>
>>> I find it works very well keeping archives of ISOs intact.  The larger
>>> the files involved, the more convenient par2 becomes.
>>
>> My take on what the Wikipedia article said implied that wasn't really
>> possible.  I'm guessing that it's subtly inaccurate somehow---my
>> understanding there was that Par2 was limited to being able to have
>> 32,768 blocks of recovery data.  That doesn't sound like it'd scale to 1
>> TB or so unless the block size is 32 MB or larger.
>
> Works fine with 8+ GB isos.  I didn't read the wiki article.

Yeah, I'll have to look at it more closely.  If it allows you to set the 
block size, then it can protect 1 TB of data using a 32MB block size (if 
I understand it correctly).

	--- Mike

-- 
A man who reasons deliberately, manages it better after studying Logic
than he could before, if he is sincere about it and has common sense.
                                    --- Carveth Read, “Logic”