[ale] One for the archives

Sat Mar 3 21:27:03 EST 2007

A server got hosed because of the following series of failures. Since
the final step was a major "GOTCHA", I am sharing it here now so that
others can avoid the pain later.

Background:

Main SOHO server with SCSI card for tape backup (old DLT 7000) and x4
200GB SATA in a software RAID setup. The main data storage area (a big
samba share spot) was stored across all 4 drives in a RAID 5 array.

System hiccups and reports a failed drive (it won't spin up at all). No
problem. Not a hot-swap system so it is taken down, the drive replaced
and the system rebooted to run-level 1. Console screen tail
in /proc/mdstat shows system is doing a drive recovery/repair onto the
new hard drive. Everything looks good.

After some period of time (approximately 10-20 minutes) the system is
seen REBOOTING!

It was assumed that all was OK as after the reboot, no forced filesystem
checks occurred. It was quite odd that the server would shut down like
that. About 2-3 minutes later, it rebooted itself again.

At this time it was determined that the power supply was failing.

It was replaced.

Later, it was determined that almost all of the files in the samba share
section were scrambled. And the backup application had lost all of it's
config files and the backup catalog (bacula).

Then the database failed to start.

Panic begins to creep in. The power blink during the hard drive recovery
had caused apparently massive damage to the storage systems.

A new drive and fresh OS was installed. The old RAID arrays were mounted
in order to extract what was usable from the samba shares. Email files
recovered OK as well as home directories. But the samba shares were
still screwball as well as all the backup system catalog and database.

So the process was begun to extract the backup catalog off the tapes.
Searching for the catalog files is a painfully laborious task on a
poky-slow tape drive when there are 21 tapes to sift through.

While the backups were being hunted down, calendar time continues on and
several weeks go by with no working backups (only one tape drive and it
spent all day "collecting it's thoughts" for recovery). A file from the
samba share was discovered to be clearly scrambled and worthless (an
installation disk for an application that had been stored with an md5
checksum). So it was deleted since the disk was available and it would
need to be recopied anyway. 

The delete took a long time to return from.

The entire filesystem had been deleted.

Everything. All files. 

The file was deleted from within containing directory using the command
rm <filename> and then answering "yes" to the "are you sure" prompt.

As far as can be discerned, the file corruption was bad enough that the
delete process was redirected to another point in the filesystem where
massive deletion occurred.

The moral of this story is three-fold:

1. Bare-metal recovery of the backup system is both hard and more
important than air.

2. Any filesystem that becomes corrupted because of a RAID 5 malfunction
should not be trusted at all under any circumstances. It should be
removed from the system and overwritten immediately and the contents
recovered from backups.

3. Any time a drive fails in a RAID system, go ahead and replace the
power supply for safety reasons. Unless it is a redundant power supply
(this was not) it will certainly cost less that the antacid bill on
this.

-- 
James P. Kinney III          
CEO & Director of Engineering 
Local Net Solutions,LLC        
770-493-8244                    
http://www.localnetsolutions.com

GPG ID: 829C6CA7 James P. Kinney III (M.S. Physics)
<jkinney at localnetsolutions.com>
Fingerprint = 3C9E 6366 54FC A3FE BA4D 0659 6190 ADC3 829C 6CA7
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part