[ale] Linux HA

Fri Nov 2 18:21:28 EDT 2007

Good failover is non-trivial.

Most failover schemes assume that the primary system either is working
100% or is completely dead.  All of the experienced people on this list
know that frequently a system becomes "sick" but is not completely dead.

Detecting that a system is sick is very complex to do reliably.

The popular failover detection of doing a ping (or responding to an ARP
request) is just plain wrong.  This is because an ICMP ping request is
responded to by the kernel at interrupt time.  Thus, a ping will NOT
detect any of the following:

     1. The system is out of memory (so that no more processes can
        be forked, preventing normal operation).

     2. The CPU utilization is 100% (so that programs will not get
        any CPU time and thus are hung).

     3. The system's disk has failed.

     4. The system's Ethernet card has failed.

     5. The Backup system has suffered an Ethernet failure (or other
        problems).

Finally, the only reliable way to shut down a sick system -- once it has
been determined to not be working correctly -- is to either shut off
power or disconnect it from the network.

My failover technology addresses all of the above problems and has been
in Production for years.  It is based on research I did for GTE
Laboratories for the Cellular phone network.

Bob Toxen
bob at verysecurelinux.com               [Please use for email to me]
http://www.verysecurelinux.com        [Network&Linux/Unix security consulting]
http://www.realworldlinuxsecurity.com [My book:"Real World Linux Security 2/e"]
Quality Linux & UNIX security and SysAdmin & software consulting since 1990.
Quality spam and virus filters.

"Microsoft: Unsafe at any clock speed!"
   -- Bob Toxen 10/03/2002

On Wed, Oct 31, 2007 at 03:21:41PM -0400, Christopher Fowler wrote:
> I've been testing some stuff in regards to Linux HA today.  Normally we
> sell 2 servers.  One is a "master" and the other is a "slave".  I've
> been testing today the capability to use a floating IP address and allow
> the slave to take over for the master.  I have a few issues that do need
> to be resolved before I can roll this out.  In my lab and colo I
> experienced 2 issues that HA could not have saved me from.
> 
> #1.  Kernel not responding.
> 
> In this case I can ping the server.  All connect()'s from clients
> seem to hang until they timeout.  In this scenario my slave will take
> the IP address but the master will still have it and still answer pings.
> Also he will still answer arp requests.  HA can't save me here.
> 
> #2.  Kernel and programs still respond but disks are off
> 
> In this case I/O to drives was hosed.  Apache would serve up pages that
> were in memory but any request in a page on disk would result in that
> connection hanging forever.  No I/O possible.  In this scenario the
> heartbeat agent will probably still see a server that is working but the
> reality would be a DoS condition.  Also upon seeing this issue I'm still
> left with a server who will not relinquish his IP address.
> 
> In both cases it seems my only recourse is to allow my slave to also
> control the power of the master.  If #1 and #2 exist the slave can
> simply take the floating IP and make a determination if he needs to kill
> power.  If so he can kill power and then the master can be repaired.
> 
> Ideas?
> 
> Chris
> 
> 
> 
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale