[ale] Failed drives on SuperMicro server backplane

Jim Kinney jim.kinney at gmail.com
Fri Oct 23 12:13:08 EDT 2009


On Fri, Oct 23, 2009 at 9:45 AM, Jeff Hubbs <jhubbslist at att.net> wrote:
> Looks like I picked a bad time to re-join the list and ask a
> Linux-related question, but I just wanted to follow up...

Hey! This is the Atlanta Linux
(freeforalllistofanythingwefindinteresting)Enthusiasts. :-)
>
> I've been getting led through the use of tw_cli (3ware userspace config
> utility) via IRC and it looks like it's the 3ware SATA card that may
> have failed and not the two drives.  However, I am told that I should
> not proceed debugging/replacing/etc. without first updating the firmware
> on the Seagate drives themselves (ST31000340AS).  To their substantial
> credit, Seagate has a downloadable ISO to make a bootable CD that will
> flash the firmware from any working PC with SATA.  So, now that the
> array is unmounted and the md devices stopped, I'm going to pull the
> remaining six drives and the two spares in the server as well as the
> "failed" drives and update the firmware on all 10 of them.

Before you jump on the firmware update, double check the drives and
versions. Seagate had a bad firmware update issue not too long ago
that would render the drives useless.
>
> I am really starting to question my decision to use kernel RAID for
> these arrays.  It makes for faster arrays, but no one is really going to
> appreciate having to "man mdadm" in order to figure out how to fail-mark
> and remove drives from arrays before pulling them and again to add the
> replacements and rebuild the arrays.  Even *I* would exchange being able
> to "slam sleds and forget" for some I/O rate.

bash scripting is your friend. bang a few like show_array_health and
pull_bad_array_drives,

Replacing bad drives will need to be single-drive per install thing so
you can enter the slot/dev params but another replace_failed_drive
script can provide a list of prior-marked bad drives for processing.

Dump your history file into a bash script is how I always start the
automation process.

BIG kernel RAID advantage is it makes system monitoring much easier
with tools like ZenOS, et al.
>
> Opinion?
>
> Jeff Hubbs wrote:
>> I've had two of eight SATA drives on a 3ware 9550 card fail due to a
>> protracted overtemp condition (SPOF HVAC).
>>
>> The eight drives are arranged in kernel RAID1 pairs and the four pairs
>> are then kernel RAID0ed (yes, it flies).  The two failed drives are in
>> different pairs (thank goodness) so the array stayed up.  I've used
>> mdadm --fail and mdadm --remove to properly mark and take out the bad
>> drives and I've replaced them with on-hand spares.
>>
>> The problem is that even with the new drives in, I don't have a usable
>> sde or sdk anymore.  For instance:
>>
>>    # fdisk /dev/sde
>>    Unable to read /dev/sde
>>
>> [note:  I've plugged spare drives into another machine and they fdisk
>> there just fine]
>>
>> In my critical log I've got "raid1: Disk failure on sde, disabling
>> device" and another such message for sdk...is there a way I can
>> re-enable them w/o a reboot?
>>
>> Two related questions:
>> This array is in a SuperMicro server with a 24-drive backplane in the
>> front.  When the two SATA drives failed, there was no LED indication
>> anywhere.  In looking at the backplane manual, there are six I2C
>> connectors that are unused, and I only have the defaults for I2C support
>> in the kernel.  The manual also says that the backplane can use I2C or
>> SGPIO.  Is there a way I can get red-LED-on-drive-failure function (red
>> LEDs come on briefly on the whole backplane at power-on)?
>>
>> I've set this array and one other 14-drive on on this machine up using
>> whole disks - i.e., /dev/sde instead of /dev/sde1 of type fd.  How
>> good/bad is that idea?  One consideration is that I'm wanting to be able
>> to move the arrays to another similar machine in case of a whole-system
>> failure and have the arrays just come up; so far, that has worked fine
>> in tests.
>>
>>
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org
>> http://mail.ale.org/mailman/listinfo/ale
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/listinfo
>>
>>
>
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>



-- 
-- 
James P. Kinney III
Actively in pursuit of Life, Liberty and Happiness



More information about the Ale mailing list