[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] error using checkpointing



Roberto Nunnari ha scritto:
Hello.

I'm new to condor and to checkpointing, but we have a small cluster
here, and I'd like to introduce checkpointing..

As queueing system, we use SGE, and at present we don't plan to
change that.

So, I'm testing condor checkpointing, but whatever I do, I always get
errors and the .ckpt file never gets created, but just the .ckpt.tmp

To test it, I use a simple program that prints a counter and then
nanosleep() for 1 second.

An update: I tried with an even simpler program, without nanosleep,
but the result is always the same.

#include <stdio.h>
int main(int argc, char *argv[]) {
  int i;
  i = 0;
  while (1) {
    if (i % 1000000 == 0) {
      printf("%d ", i++);
      fflush(stdout);
    }
  }
  return 0;
}

Thank you for your help
Best regards.
Robi



first, I get warnings during compilation:

$ condor_compile gcc -o blah4 blah.c
LINKING FOR CONDOR : /usr/bin/ld -L/opt/condor-7.4.4/lib -Bstatic --eh-frame-hdr -m elf_x86_64 --hash-style=gnu -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o blah4 /opt/condor-7.4.4/lib/condor_rt0.o /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbeginT.o -L/opt/condor-7.4.4/lib -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 /tmp/ccMbjxCs.o /opt/condor-7.4.4/lib/libcondorsyscall.a /opt/condor-7.4.4/lib/libcondor_z.a /opt/condor-7.4.4/lib/libcomp_libstdc++.a /opt/condor-7.4.4/lib/libcomp_libgcc.a /opt/condor-7.4.4/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /opt/condor-7.4.4/lib/libcomp_libgcc.a /opt/condor-7.4.4/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crtn.o /opt/condor-7.4.4/lib/libcondorsyscall.a(condor_file_agent.o): In function `CondorFileAgent::open(char const*, int, int)': /opt/cluster/spool/condor/try01/condor-7.4.4/src/condor_ckpt/condor_file_agent.cpp:106: warning: the use of `tmpnam' is dangerous, better use `mkstemp' /opt/condor-7.4.4/lib/libcondorsyscall.a(special_stubs.o): In function `condor_gethostbyaddr': /opt/cluster/spool/condor/try01/condor-7.4.4/src/condor_syscall_lib/special_stubs.cpp:201: warning: Using 'gethostbyaddr' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking /opt/condor-7.4.4/lib/libcondorsyscall.a(special_stubs.o): In function `condor_gethostbyname': /opt/cluster/spool/condor/try01/condor-7.4.4/src/condor_syscall_lib/special_stubs.cpp:194: warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking /opt/condor-7.4.4/lib/libcondorsyscall.a(sock.o): In function `Sock::getportbyserv(char*)': /opt/cluster/spool/condor/try01/condor-7.4.4/src/condor_io/sock.cpp:233: warning: Using 'getservbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking



Then, here's the run session, interrupted with SIGTSTP:

$ ./blah3 -_condor_D_ALL
User Job - $CondorPlatform: X86_64-LINUX_RHEL5 $
User Job - $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
Condor: Notice: Will checkpoint to ./blah3.ckpt
Condor: Notice: Remote system calls disabled.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Got SIGTSTP
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       134
        dups:         1
        open flags:   0x1
        url:          fd:1
        size:         134
        opens:        1
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
working dir = /homea/nunnari/devel/provacondor
Done saving file state
About to update MyImage
Adding a DATA segment: start[0x76c000], end [0x10111000]
Image::AddSegment: name=[DATA], start=[76c000], end=[10111000], length=[0xf9a5000], prot=[0x7fff00000000]
Adding a STACK segment: start[0x7fffb2603000], end [0x7fffb260dfff]
Image::AddSegment: name=[STACK], start=[7fffb2603000], end=[7fffb260dfff], length=[0xafff], prot=[0x7fff00000000]
Pos: 261772320
Pos: 261817375
Size of ckpt image = 261817375 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./blah3.ckpt
Checkpoint name is "./blah3.ckpt"
Tmp name is "./blah3.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0x76c000,len=0xf9a5000)
I wrote 740320 bytes with write...
I wrote -1 bytes with write...
in SegMap::Write(): fd = 3, write_size=261030944
errno=14, core_loc=820be0
Write() Segment[0] of type DATA -> FAILED
errno = 14, nbytes = -1
Ckpt exit
Write failed with [-1]
Killed


I tried all types of install I could find, both binary and compile from source, but the result is always the same.

My environment:
$ uname -rms
Linux 2.6.18-194.el5 x86_64
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.5 (Tikanga)

Thank you for your help!
Best regards.

Roberto Nunnari