[ale] Ganglia problem

Dow_Hurst dhurst at mindspring.com
Sat Apr 1 20:40:36 EST 2006


Had a power failure at UNCG on Friday evening.  Today from a cold boot of the head node and slave nodes, ganglia's gmond doesn't seem to talk to other nodes.  Each node, the head included, only sees itself when gstat -a is issued.  I've beat on this all day and haven't figured out the problem or even how to diagnose it properly.  We are using the ammasso interfaces for cluster communications so I don't have a typical ethx to work thru.  My interface for the cluster communications is ccilnet0.  Another symptom is that our computational program, NAMD, is having communication problems as well.  Using mpirun to start 1-8 jobs will start fine and seem to write to the output files with no problem.  Starting larger jobs utilizing more CPUs will either freeze at the same point in the run for ~30 CPUs or fail erratically in the middle of the run for ~12CPUs.  I was wondering if anyone has any ideas on how to approach diagnosing the problem?  The system runs Centos 4.0(Final) on the head node and all slave nodes.  I have a local repository on the head node to use for software installation in case you suggest a tool not already installed.  I am hesitant to point yum outside of the local repository for this server.  Rebooting the SMC8624T switch did not help with the gmond communication.  Thanks,
Dow


No sig.



More information about the Ale mailing list