[ale] They say drives fail in pairs...

Jim Kinney jim.kinney at gmail.com
Tue Jan 3 17:32:32 EST 2012


Good writeup!

<snipage>



> This all seems fine and dandy.  However, I had no Internet access at
> home at the time so I couldn't use my computer SSH in.  I tried from the
> phone, but it wasn't letting me in, either.  I figured the cell network
> was being crappy, so I went to the office.  When I arrived, I found that
> the system was utterly unresponsive, like I'd expect to find a Windows
> box with its head shoved up its ass.  The access light for the bad drive
> was on, and nothing else.  The only thing the system would respond to
> was the Magic SysRq key.  So, I did an emergency unmount, sync for
> several seconds, and then rebooted.  When it came back up the kernel
> said "Oh, I was in the middle of scrubbing, lemmie get right back to
> that", with all five members in the array active.
>
> Waitaminute.  It should have kept the data from before that said that
> the disk at /dev/sdb failed.  It didn't.
>

Actually, it did. It had already marked the sdb drive as dirty and for
scrubbing to continue as long as needed.


>
> Fine, so I tried to take /dev/sdb out, which succeeded, and then I
> rebooted again.  Nope, it didn't remember, and it started scrubbing again.
>
> Fine, I said.  So the next morning (nothing was open at this point) I
> went and got a 1 TB disk (ouch---NOT what I wanted, but nobody seems to
> have any 750 GB disks anywhere).  Fine.  I went in, and the server was
> again unresponsive, it would only listen to the magic sysrq.
>
> I rebooted, and dropped to a shell.  At this point, the kernel wouldn't
> do anything with the array.  It said "failed to start dirty degraded
> array".  I thought I was surely screwed.
>

The linguistic connotations here are hilarious!

>
> At this point it said that /dev/sd{c,d1,e1,f1} were fine and in-sync.
> It said that /dev/sdb was a "spare" (what?  It failed!  No sane person,
> no idiot even, would use that as a spare for anything!).
>

Yep. sdb is totally hosed. Most likely the tiny 1-2 block reserved section
used by RAID systems for storing data about the drive.

>
> Anyway, so I tried to remove /dev/sdb.  It said that it was busy.
>
> I said, "fine, I'll swap it out."  So I did.  Then it said "unable to
> add disk to array."  The other one was gone finally, but now I couldn't
> add a disk.  Alright.
>
> So I put /dev/sdb back and was back at the dirty degraded place I was
> before.  Finally, I thought to boot with a Live CD in the hopes that it
> wouldn't autostart the array.  Then I realized I'd forgotten about mdadm
> --stop /dev/md0 --- so I tried that, and it worked (though not the first
> time, the first time it said it was busy).  Alright, cool.  So I then
> reassembled the array with /dev/sd{c,d1,e1,f1} and it came up.  I
> exited, the system resumed booting, and it didn't attempt to scrub.
> Awesome.
>
Yes. You found the magic incantation. Stop the array, slaughter the dead
drive, do admin things, start the array.

>
> At this point I powered down, replaced the drive, partitioned the new
> drive and made it a member of the array.  Fantastic.  It's working
> again, rebuilding the new drive (partition).
>
> And then this morning, as it finished rebuilding that drive it found
> another bad one.  At least this time swapping out the disk was a simple,
> quick matter (10, maybe 15 minutes).  It seems that if the kernel finds
> a bad disk while scrubbing, it doesn't handle it all that well.  But
> having found the new dead drive without it trying to scrub the array was
> much easier, as it did what I asked without putting up a fight.
>

computers suck sometimes. alcohol helps. If you drink enough, you don't
care about pouring a fifth of grain into a drive array and setting it on
fire!

>
> Sigh.
>
> Happy new year!
>
> --
> A man who reasons deliberately, manages it better after studying Logic
> than he could before, if he is sincere about it and has common sense.
>                                   --- Carveth Read, “Logic”
>
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
>



-- 
-- 
James P. Kinney III

As long as the general population is passive, apathetic, diverted to
consumerism or hatred of the vulnerable, then the powerful can do as they
please, and those who survive will be left to contemplate the outcome.
- *2011 Noam Chomsky

http://heretothereideas.blogspot.com/
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.ale.org/pipermail/ale/attachments/20120103/01c34984/attachment-0001.html 


More information about the Ale mailing list