[ale] They say drives fail in pairs...

Tue Jan 3 15:39:03 EST 2012

On 01/03/2012 12:41 PM, Jim Kinney wrote:
> Jim's RULE of RAID: NEVER, EVER INSTALL RAID DRIVES WITH SEQUENTIAL
> SERIAL NUMBERS!!!
> 
> If possible, avoid RAID drive purchases from the same supplier and
> manufacturer.

Indeed.  I already handled that.  When the RAID was deployed, it was
done with all drives from a single batch because it was rushed.
However, the plan was to change one drive every three months until none
of the originals was left.  All 5 drives in the array now are drives
that are from different places and batches.

> I was bit once by a RAID5 failure. The lesson I learned is to image ALL
> remaining drives before replacing the failed drive.

What I find interesting is that two drives croaked within days of each
other.  Fortunately, I wasn't lazy enough to let the first one go until
the second one died; the second one died literally 10 seconds before the
rebuild of the first one completed.  Being RAID 6, that was tolerable,
if a bit close for my comfort.

Here is crossing my fingers that no more failures come (soon, that is).

> Google did research on drive failure rates and recover times and found
> that once a drive is larger than 500GB, a RAID5 system is a bad idea.

I have always been allergic to the notion of RAID 5.  The very idea that
you can only withstand one lost disk seems stupid going as far back as I
can remember, with 5 and 10 MB MFM hard disk drives.  I seem to recall
back then that if you cared about your data you always had copies of it
on floppies, because you never trusted the HDD for robustness, only for
ease of speedy access. (LOL, I just called 10 MB MFM HDDs speedy...)

> The failure rate of similar age drives (even mixed version, model, etc.)
> was such that the array was likely to fail a second drive before the
> first failure had been recovered from. RAID6 only buys a bit more time
> as it requires a 3 drive failure for data loss. Based on that and my own
> experience, RAID 10 is the only stuff I use now. When the data
> ABSOLUTELY must be available, I use a 3 drive mirror spread across 3
> different controllers (so, yes, software raid) so typically, the mobo
> controller and 2 additional cards. 12 drives in 4 stripes with 3 copies
> of each stripe makes me happy. It's an especially good setup for write
> seldom, read often servers. The write speed is not sufficient for
> high-write-throughput database stuff so I'd incorporate some SSD
> hardware for WAL stuff.

I am trying to figure out what I am going to do next with this setup.  I
want to move the bulk of the data out of the office and onto a setup
that can sustain about 50/50 read/write.  (It is more or less even; the
humans mostly read, and automatic processes mostly write, a lot.)

Anyway, on to the real interesting part here.

The way the kernel handles failure in a RAID array when scrubbing leaves
a lot to be desired.  I'm not sure if this is fixed in a newer kernel or
not (and I don't have the resources to really test, and I haven't dug
into the source code just yet).  The whole point of scrubbing the array
is to validate the assertion that the array is clean and healthy.

During January's scrub operation, however, is when it found that one
disk was dying.  What happened:

  - The scrub was about 90% complete, when it found errors on
    /dev/sdb.

  - The kernel logged to its ring buffer that I/O errors were
    encountered on /dev/sdb.

  - The kernel logged to its ring buffer that the RAID driver corrected
    read errors from /dev/sdb across 40 sectors.

  - The mdadm monitor sent a notification to root, which was routed
    to my cell phone and email.

This all seems fine and dandy.  However, I had no Internet access at
home at the time so I couldn't use my computer SSH in.  I tried from the
phone, but it wasn't letting me in, either.  I figured the cell network
was being crappy, so I went to the office.  When I arrived, I found that
the system was utterly unresponsive, like I'd expect to find a Windows
box with its head shoved up its ass.  The access light for the bad drive
was on, and nothing else.  The only thing the system would respond to
was the Magic SysRq key.  So, I did an emergency unmount, sync for
several seconds, and then rebooted.  When it came back up the kernel
said "Oh, I was in the middle of scrubbing, lemmie get right back to
that", with all five members in the array active.

Waitaminute.  It should have kept the data from before that said that
the disk at /dev/sdb failed.  It didn't.

Fine, so I tried to take /dev/sdb out, which succeeded, and then I
rebooted again.  Nope, it didn't remember, and it started scrubbing again.

Fine, I said.  So the next morning (nothing was open at this point) I
went and got a 1 TB disk (ouch---NOT what I wanted, but nobody seems to
have any 750 GB disks anywhere).  Fine.  I went in, and the server was
again unresponsive, it would only listen to the magic sysrq.

I rebooted, and dropped to a shell.  At this point, the kernel wouldn't
do anything with the array.  It said "failed to start dirty degraded
array".  I thought I was surely screwed.

At this point it said that /dev/sd{c,d1,e1,f1} were fine and in-sync.
It said that /dev/sdb was a "spare" (what?  It failed!  No sane person,
no idiot even, would use that as a spare for anything!).

Anyway, so I tried to remove /dev/sdb.  It said that it was busy.

I said, "fine, I'll swap it out."  So I did.  Then it said "unable to
add disk to array."  The other one was gone finally, but now I couldn't
add a disk.  Alright.

So I put /dev/sdb back and was back at the dirty degraded place I was
before.  Finally, I thought to boot with a Live CD in the hopes that it
wouldn't autostart the array.  Then I realized I'd forgotten about mdadm
--stop /dev/md0 --- so I tried that, and it worked (though not the first
time, the first time it said it was busy).  Alright, cool.  So I then
reassembled the array with /dev/sd{c,d1,e1,f1} and it came up.  I
exited, the system resumed booting, and it didn't attempt to scrub.
Awesome.

At this point I powered down, replaced the drive, partitioned the new
drive and made it a member of the array.  Fantastic.  It's working
again, rebuilding the new drive (partition).

And then this morning, as it finished rebuilding that drive it found
another bad one.  At least this time swapping out the disk was a simple,
quick matter (10, maybe 15 minutes).  It seems that if the kernel finds
a bad disk while scrubbing, it doesn't handle it all that well.  But
having found the new dead drive without it trying to scrub the array was
much easier, as it did what I asked without putting up a fight.

Sigh.

Happy new year!

-- 
A man who reasons deliberately, manages it better after studying Logic
than he could before, if he is sincere about it and has common sense.
                                   --- Carveth Read, “Logic”

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 729 bytes
Desc: OpenPGP digital signature
Url : http://mail.ale.org/pipermail/ale/attachments/20120103/ca708e5b/attachment.bin