Locating segmentation fault for mutithread program running on cluster
It's quite straightforward to use gdb
in order to locate a segmentation fault while running a simple program in interactive mode. But consider we have a multithread program - written by pthread
- submitted to a cluster node (by qsub
command). So we don't have an interactive operation.
How we can locate the segmentation fault? I am looking for a general approach, a program or test tool. I can not provide a reproducible example as the program is really big and crashes on the cluster in some unknown situations.
I need to find a problem in such hard situation because the program runs correctly on the local machine with any number of threads.
The "normal" approach is to have the environment produce a core file and get hold of those. If this isn't an option, you might want to try installing a signal handler for SIGSEGV
which obtains, at least, a stack trace dumped somewhere. Of course, this immediately leads to the question "how to get a stack trace" but this is answered elsewhere.
The easiest approach is probably to get hold of a core file. Assuming you have a similar machine where the core file can be read, you can use gdb program corefile
to debug the program program
which produced the core file corefile
: You should be able to look at the different threads, their data (to some extend), etc. If you don't have a suitable machine it may be necessary to cross-compile gdb
matching the hardware of the machine where it was run.
I'm a bit confused about the statement that the core files are empty: You can set the limits for core files using ulimit
on the shell. If the size for cores is set to zero it shouldn't produce any core file. Producing an empty one seems odd. However, if you cannot change the limits on your program you are probably down to installing a signal handler and dumping out a stack trace from the offending thread.
Thinking of it, you may be able to put the program to sleep in the signal handler and attach to it using a debugger, assuming you can run a debugger on the corresponding machine. You would determine the process ID (using, eg, ps -elf | grep program
) and then attach to it using
gdb program pid
I'm not sure how to put a program to sleep from within the program, though (possibly installing the handler for SIGSTOP
for SIGSEGV
...).
That said, I assume you tried running your program on your local machine...? Some problems are more fundamental than needing a distributed system of many threads running on each node. This is, obviously, not a replacement for the approach above.
链接地址: http://www.djcxy.com/p/66028.html