[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Segfault when resuming from checkpoint



Hi,
On Fri, Mar 30, 2007 at 04:43:59PM +0200, weeber@xxxxxxxxxxxxxxxxxxxx wrote:
> Hi,
> I have a problem with jobs, that segfault, when resuming from a checkpoint after they were evicted.
> As far as I can see from the ShadowLog, the last thing that happens is, that the state of the "/dev/null" file handle is restored.
> That seems to mean, that the segfault occurs before the execution of the user code is resumed.
I did some more testing and found the following:
* The problem also occurs for stand-alone checkpointing. When I run teh program with -_condor_restart and the image-name, it sometimes(!) segfaults. If I try the same image again immediately, it works most of the time, but not always.
* It is independent of the speciffic program (I used a test program that only counts up an integer, and also one, that just calls ckpt_and_exit() in an endless loop.)
* The problem occurs on Suse 10.0 with Gcc 4.0, Suse 10.2 with Gcc 4.1
* It does not occur on Suse 9.x with Gcc 3.4

Does anyone have any idea, on what to try next?

Thanks alot in advance,
Rudolf