[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] How to use condor checkpointing with SGE



Hi,

I would like to use condor's standalone checkpointing to enable
checkpointing jobs that are run via Sun Grid Engine (SGE). I've
successfully compiled a toy C program using condor_compile, and I can
successfully run, stop, and resume the job with its checkpoint file.

When I attempt to run my toy using qsub as an SGE job with
checkpointing enabled, the job gets queued up but never runs. The job
runs fine if submitted without checkpointing. Has anyone here
successfully run SGE jobs using condor checkpointing?

For reference, here's my configuration. Within SGE's qmon utility, I
defined a checkpoint object called "condor" the following
configuration:

Name: condor
Interface: TRANSPARENT
Checkpoint command: NONE
Migrate command: NONE
Clean command: NONE
Checkpoint directory: /tmp
Checkpoint When: xsr
Checkpoint Signal: NONE

To submit the job with checkpointing, I ran this:
qsub -ckpt condor /home/lane/toy.sh -_condor

Where toy.sh is:
#!/bin/bash

/usr/bin/setarch x86_64 -R -L /home/lane/toy -_condor_D_ALL


The job as submitted above gets a "qw" status, but never runs. If I
submitting the job without "-ckpt condor" then it runs.

Any pointers to tips would be appreciated. I've done quite a bit of
research online; it appears that this should be possible, but I just
haven't had any success figuring out how.

Cheers,
Lane


On Tue, Mar 8, 2011 at 2:03 PM, Lane Schwartz <dowobeha@xxxxxxxxx> wrote:
> Hi, I'm new to condor. I just installed condor 7.4.4 on Centos 5.5,
> and I'm trying to try out standalone checkpointing for the first time.
> Unfortunately, I'm getting a segmentation fault when I try to restart
> a program using a checkpoint file.
>
> I've been following the instructions in section 4.2.1 of the manual
> (http://www.cs.wisc.edu/condor/manual/v6.4/4_2Condor_s_Checkpoint.html).
> Details are below:
>
> I have a program called toy.c:
>
> $ condor_compile gcc -o toy toy.c
> LINKING FOR CONDOR ......(some more output).....
>
> $ ./toy
> ...(TOY PROGRAM OUTPUT)....
>
> (control-Z)
> ...(PROGRAM STOPS)...
>
> $ ./toy -_condor_restart ./toy.ckpt.tmp
> Condor: Notice: Will restart from ./toy.ckpt.tmp
> Segmentation fault
>
>
> My eventual goal is to use condor for transparent checkpointing of
> jobs using SGE (Sun Grid Engine). But at the moment I can't even get
> this toy standalone example to work. (For reference, the source for
> toy.c is below)
>
> If anyone has any tips or pointers, or links to good tutorials on the
> use of standalone checkpointing, I'd be much obliged.
>
> Thanks,
> Lane
>
>
> //toy.c
> #include <stdio.h>
>
> int main(int argc, char **argv) {
>
>   int i;
>   int n;
>
>   n=1024*1024*1024;
>
>   for (i=0; i<n; i+=1) {
>      printf("We calculated: %d^2=%d\n", i, i*i);
>   }
>
>   return 0;
> }
>



-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
                -- R.A. Heinlein, "Time Enough For Love"