[ale] new (to me) raid 4/5 failure mode

Greg Freemyer greg.freemyer at gmail.com
Mon Aug 24 15:18:09 EDT 2009


On Mon, Aug 24, 2009 at 10:54 AM, Pat Regan<thehead at patshead.com> wrote:
> Greg Freemyer wrote:
>> If you are using raid 4 or 5 or considering it in unreliable
>> environments, you may want to think about this.  By unreliable I mean
>> your system fails at unpredictable times due to power, bad hardware,
>> kernel crashes, etc.
>
> RAID helps protect your data from disk failures and not much else.  You
> can increase your reliability with a UPS, a battery backed up RAID
> controller, and multiple disk controllers.
>
>> But in general d2 and p updates are non-atomic in relation to each
>> other, so their is a short period time where either:
>>
>> d2' ^ p ==> garbage or
>> d2 ^ p' ==> garbage
>
> If your application is calling an fsync the disks in the array should
> never be in this state.  If they are, something is probably wrong.  An
> fsync call isn't supposed to return until the data is actually on the
> platter.
>
> If your application isn't calling fsync then the data must not be that
> important :).

fsync does not address this issue, which is a small number of
milliseconds of vulnerability for mos. disk write.

>> So if a system or power fail occurs, d1 becomes garbage, even though
>> it was never written to by an application!
>
> d1 was garbage the moment it dropped out of the array :).

But it is recreatable from d2 and p, or from d2' and p'.


It is not recreatable from d2 and p'  or from d2' and p.    And if you
have system shutdown during that window of vulnerability, that is all
you will have to work with.

>> Obviously, I don't mean the entire disk.  Just those d1 chunks that
>> are part of a partially updated stripe.
>
> Once a disk drops out of an array I would expect all data on the drive
> to bad.

But you expect it to be recreatable from the other drives in the
raidset.  Or at least I do.

>> I had never considered that stable data sitting on a raid 5 might
>> change randomly, even if they were never written to.  I have never
>> been a fan of raid 5.  In fact, I only considered it a good choice for
>> low-end situations.  I think the above rules it out for most of those.
>
> There's nothing special about RAID 5 in this regard.  Any time you have
> multiple disks they can get out of sync during loss of power.

Raid 5 fully operational will not cause data on one drive to be lost
because a write is going on to another drive and you have an
unexpected shutdown.

> Silent data corruption is the reason ZFS and other new filesystems like
> btrfs write checksums for every block to the disk.  Read up on Sun's
> testing regarding single bit errors per TB of data...  It's pretty scary.
>
>> I think raid 6 would only have a similar issue in a dual disk failure
>> mode, but I'm not positive.
>
> RAID 5 and 6 are both really only suitable for situations that are
> mostly read intensive.  They both suffer from the same "write hole"
> problem.  A cache miss on a write requires a read of 1 stripe from every
> drive, a parity computation, and a write of one stripe to every drive.
> All writes require a hit to every disk.
>
> Sequential writes aren't so bad, random writes are a killer :).

Yeah, like I said, I'm not a raid 5 fan at all.  raid 6 I think has
its place, but probably on in arrays with lots of drives so that you
can do more work in parallel.  I've pretty much given up on raid 5,
and this failure mode is just one more nail in the coffin for me.

> Pat

Greg



More information about the Ale mailing list