Locating segmentation fault for mutithread program running on cluster

It's quite straightforward to use gdb in order to locate a segmentation fault while running a simple program in interactive mode. But consider we have a multithread program - written by pthread - submitted to a cluster node (by qsub command). So we don't have an interactive operation.

How we can locate the segmentation fault? I am looking for a general approach, a program or test tool. I can not provide a reproducible example as the program is really big and crashes on the cluster in some unknown situations.

I need to find a problem in such hard situation because the program runs correctly on the local machine with any number of threads.


The "normal" approach is to have the environment produce a core file and get hold of those. If this isn't an option, you might want to try installing a signal handler for SIGSEGV which obtains, at least, a stack trace dumped somewhere. Of course, this immediately leads to the question "how to get a stack trace" but this is answered elsewhere.

The easiest approach is probably to get hold of a core file. Assuming you have a similar machine where the core file can be read, you can use gdb program corefile to debug the program program which produced the core file corefile : You should be able to look at the different threads, their data (to some extend), etc. If you don't have a suitable machine it may be necessary to cross-compile gdb matching the hardware of the machine where it was run.

I'm a bit confused about the statement that the core files are empty: You can set the limits for core files using ulimit on the shell. If the size for cores is set to zero it shouldn't produce any core file. Producing an empty one seems odd. However, if you cannot change the limits on your program you are probably down to installing a signal handler and dumping out a stack trace from the offending thread.

Thinking of it, you may be able to put the program to sleep in the signal handler and attach to it using a debugger, assuming you can run a debugger on the corresponding machine. You would determine the process ID (using, eg, ps -elf | grep program ) and then attach to it using

gdb program pid

I'm not sure how to put a program to sleep from within the program, though (possibly installing the handler for SIGSTOP for SIGSEGV ...).

That said, I assume you tried running your program on your local machine...? Some problems are more fundamental than needing a distributed system of many threads running on each node. This is, obviously, not a replacement for the approach above.

链接地址: http://www.djcxy.com/p/66028.html

上一篇: RunWith(PowerMockRunner.class)不适用于软件包注释

下一篇: 为集群上运行的mutithread程序定位分段错误