[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Issues with checkpointing



Brian,

> This is again related to the problem with jobs not checkpointing when
> evicted. If anyone has any insight, I would appreciate it.
> 
> The executable is weiweicase10. I get the following message when I
> run the program on a local station from a terminal:
> 
> Condor: Notice: Will checkpoint to weiweicase10.ckpt
> Condor: Notice: Remote system calls disabled.
> ...
> <program runs a while>
> <I press CONTROL-Z to suspend the job>
> 
> ^ZKilled
> unixlab03%
> --------------------
> and its killed. I'm wondering if the job is supposed to be suspended
> rather than be killed in order to be able to checkpoint. This executable
> was compiled from a fortran 90 program.
> 
> In that case, is there something we are supposed to do to make the
> executable suspendable?
> 
> Where would the checkpoints be created, and which directory?

Checkpoints should be created in the current directory.

Try running it like this:


weiweicase10 -_condor_D_ALL [any other args]


In order to get some debugging output.

Operating system and condor version may be helpful too.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison