[ale] How to debug a program that just goes away

Jim Lynch ale_nospam at fayettedigital.com
Fri Feb 26 07:23:07 EST 2010


I have a multi-threaded c++ program that occasionally just stops 
running.  At the time it stops it is usually not doing anything.  Every 
thread is either waiting on a semaphore or sleeping (Thread::sleep).  
It's event driven and no events have arrived for some time.  I have lots 
of prints to be able to tell where it is and what it's doing.  No core 
file generated.  No strange messages in any log file, either system or 
application.  No rogue processes killing it off. 

The program runs successfully on multiple other machines but not this 
one.  It's a newer system than the others.  I recompiled on this system, 
thinking it may help but no.  Access to this system is limited to two 
people, myself and one other.  I trust him since he's got more to lose 
than I do if it doesn't work.  I can work around it with a wrapper, 
restarting when it fails, but I'd really like to understand how it's 
happening. 

I have ulimit -c 50000 in the script that runs it, so a core will be 
generated if it aborts.  I trap SIGHUP, SIGINT, SIGCHLD and SIGQUIT and 
will see something in the log file if a signal is trapped.  It's on a 
Centos 4.7 system.  Same OS as the other running systems.  The only 
difference is that this is a newer dual core system.  Considerably 
faster also.  I've run both a conventional kernel and an openvz kernel.  
I'm compiling with " -g -O2" flags.

I have no idea how to proceed from here.  Can anyone suggest something I 
could do to find out what's the cause?

Thanks,
Jim.


More information about the Ale mailing list