[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Issues with checkpointing



Here was the initial output from running the command with the option you
specified:

unixlab03% weiweicase10 -_condor_D_ALL 3
User Job - $CondorPlatform: SUN4X-SOLARIS29 $
User Job - $CondorVersion: 6.8.5 May 17 2007 $
Condor: Notice: Will checkpoint to weiweicase10.ckpt
Condor: Notice: Remote system calls disabled.

<...etc...>
> Brian,
>
>> This is again related to the problem with jobs not checkpointing when
>> evicted. If anyone has any insight, I would appreciate it.
>>
>> The executable is weiweicase10. I get the following message when I
>> run the program on a local station from a terminal:
>>
>> Condor: Notice: Will checkpoint to weiweicase10.ckpt
>> Condor: Notice: Remote system calls disabled.
>> ...
>> <program runs a while>
>> <I press CONTROL-Z to suspend the job>
>>
>> ^ZKilled
>> unixlab03%
>> --------------------
>> and its killed. I'm wondering if the job is supposed to be suspended
>> rather than be killed in order to be able to checkpoint. This executable
>> was compiled from a fortran 90 program.
>>
>> In that case, is there something we are supposed to do to make the
>> executable suspendable?
>>
>> Where would the checkpoints be created, and which directory?
>
> Checkpoints should be created in the current directory.
>
> Try running it like this:
>
>
> weiweicase10 -_condor_D_ALL [any other args]
>
>
> In order to get some debugging output.
>
> Operating system and condor version may be helpful too.
>
> --
> Daniel K. Forrest	Laboratory for Molecular and
> forrest@xxxxxxxxxxxxx	Computational Genomics
> (608) 262 - 9479	University of Wisconsin, Madison
>


----------------------------------------
Brian C. Dandurand
Clemson University
Department of Mathematical Sciences
Ph.D. Student
Office: Martin Hall E-6
Office Phone: (864)656-4749
----------------------------------------