[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Standalone checkpoint error ...



On Fri, Feb 03, 2006 at 03:29:44PM +0000, Goncalo Borges wrote:
> 
> Hello everybody,
> 
> I'm trying to use the standalone checkpoint features provided by condor in 
> our cluster. Here are the features of our machines:
> 
> [goncalo@lflip02 ~]$ uname -a
> Linux lflip02.lip.pt 2.4.21-32.0.1.ELsmp #1 SMP Wed May 25 15:42:26 CDT 
> 2005 i686 i686 i386 GNU/Linux
> 

That kernel probably has address space randomization enabled, which causes
problems with Condor checkpointing (things aren't where we expect them)

<...>
> 
> I have compiled the ever.c program: 
> 
<...>
> When I test the program interactively, it stars running with 
> the right messages:
> 
> [goncalo@lflip02 ~]$ ./ever
> Condor: Notice: Will checkpoint to ./ever.ckpt
> Condor: Notice: Remote system calls disabled.
> 
> Then, after login in in other console, I do a "kill -s USR2 <pid>".
> The programs is stopped with a segmentation fault error and it creates a 
> ever.ckpt.tmp file.
>  
> [goncalo@lflip02 ~]$ ./ever
> Condor: Notice: Will checkpoint to ./ever.ckpt
> Condor: Notice: Remote system calls disabled.
> Segmentation fault (core dumped)
> 

Yeah, that's not what you should see.

> 
> Then, I try to restart the program using the ever.ckpt.tmp file but it is 
> immediatelly killed.
> 

Yup, the .tmp file isn't a complete checkpoint.

> [goncalo@lflip02 ~]$ ./ever -_condor_restart ever.ckpt.tmp
> Condor: Notice: Will restart from ever.ckpt.tmp
> Killed
> 
> I guess this is not the expected behaviour. Maybe there is an obvious 
> reason why this is happening, which I'm forgetting.
> 

You need to run your program under the old memory layout:

[goncalo@lflip02 ~]$ setarch i386 ./ever 

and then, to restart,

[goncalo@lflip02 ~]$ setarch i386 ./ever -_condor_restart ever.ckpt

(Condor automatically does the equivelent of a 'setarch i386' before running
standard universe jobs, which is why it works inside of Condor)

-Erik