[ale] Seagate 1.5TB drives, bad blocks, md raid, lvm, and hard lock-ups

Thu Jan 7 12:15:03 EST 2010

On Wed, Jan 6, 2010 at 3:09 PM, Brian W. Neu <ale at advancedopen.com> wrote:
> I have a graphic design client with a 2U server running Fedora 11 and now 12
> which is at a colo handling their backups.  The server has 8 drives with
> Linux md raids & LVM on top of them.  The primary filesystems are ext4 and
> there is/was an LVM swap space.
>
> I've had an absolutely awful experience with these Seagate 1.5 TB drives,
> returning 10 out of the original 14 due to the ever increasing SMART
> "Reallocated_Sector_Ct" due to bad blocks.  The server that the client has
> at their office has a 3ware 9650(I think) that has done a great job of
> handling the bad blocks from this same batch of drives and sending email
> notifications of one of the drives that grew more and more bad blocks.  This
> 2U though is obviously pure software raid, and it has started locking up.
>
> As a stabilizing measure, I've disable the swap space, hoping the lockups
> were caused by failure to read/write from swap.  I have yet to let the
> server run over time and assess if this was successful.
>
> However, I'm doing a lot of reading today on how md & LVM handle bad blocks
> and I'm really shocked.  I found this article (which may be outdated) which
> claimed that md relies heavily on the firmware of the disk to handle these
> problems and when rebuilding an array there are no "common sense" integrity
> checks to assure that the right data is reincorporated back into the healthy
> array.  Then I've read more and more articles about drives that were
> silently corrupting data.  It's turned my stomach.  Btrfs isn't ready for a
> this, even though RAID5 was very recently incorporated.  And I don't see
> btrfs becoming a production stable file system until 2011 at the earliest.
>
> Am I totally wrong about suspecting bad blocks for causing the lock-ups?
> (syslog records nothing)
> Can md RAID be trusted with flaky drives?
> If it's the drives, then other than installing OpenSolaris and ZFS, how to I
> make this server reliable?
> Any experiences with defeating mysterious lock-ups?

It's possible these could be caused by TLER:

http://en.wikipedia.org/wiki/TLER

I am using a bunch of Seagate 1.5TB drives with ZFS (it has built-in
data checksumming, which silently repairs corruption), and have seen
higher than normal failure rates from these drives. The reviews on
Newegg seem to indicate that these drives have TONS of issues, and I
won't be purchasing any additional Seagate disks in the future.

- Ryan
--
http://prefetch.net