[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] checkpointing produces segfault



Hi,

I have a somewhat strange problem. I linked my code with condor_compile and everything worked just fine. Also checkpointing worked fine. Now, it stopped working, more precisely: the program segfaults at random times, but runs fine otherwise. It seems that only ca. 50% of jobs are affected. I have no clue what component in the system changed. The userlog tells something like:

01 (1627.026.000) 02/24 19:08:41 Job executing on host: <144.92.180.55:33798>
...
005 (1627.026.000) 02/24 19:08:41 Job terminated.
        (0) Abnormal termination (signal 11)
        (0) No core file
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        2432  -  Run Bytes Sent By Job
        3003342  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job

I tried to run this job on that machine by hand and it works - no segfaults. Thus I looked in more detail and tried to make it checkpoint by sending SIGTSTP and voila I get a segfault. If I look at the core dump and the stack I find it always looks like that:

#0  0x08102788 in adler32 ()
#1  0x080fde76 in fill_window ()
#2  0x080fdc61 in deflate_slow ()
#3  0x080fcc87 in deflate ()
#4  0x080c704b in SegMap::Write ()
#5  0x080c682c in Image::Write ()
#6  0x080c6503 in Image::Write ()
#7  0x080c6382 in Image::Write ()
#8  0x080c7751 in Checkpoint ()
#9  <signal handler called>

It seems that 'adler32' is the last thing called. Searching the list archive I found one message stating a similar problem, but no solution. Any help would be much appreciated.


Thanks,
Patrick
--
Dr. Patrick Huber                       Physics Department
                                        University of Wisconsin
Tel.:+1 608 262 2886                    1150 University Avenue
http://pheno.physics.wisc.edu/~phuber   Madison, WI 53706, USA