[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problem restarting from standalone checkpointed programs.



Hi,
I have a problem restarting programs with standalone checkpointing.
I hope this may be the correct forum for this type of question,
otherwise I apologize and wish to be pointed to an apropriate
place.

I have a workstation running 32-bit RedHat Enterprise Linux 5.
(e.g.

     sh-3.2$ uname -a
     Linux sinclaire.llnl.gov 2.6.18-128.el5 #1 SMP \
     Wed Dec 17 11:42:39EST 2008 i686 i686 i386 GNU/Linux
)

I have tried the condor 7.4.3 RedHat binaries, and also built
the 7.4.3 and 7.53 versions from source. In all cases I get
segmentation fault when trying to restart from a checkpoint.

My code simply prints a line every second, and every five seconds
checkpoints itself. If given the command line option -load [filename],
it tries to restart from the checkpoint [filename]. (I get similar
errors trying to restart using -_condor_restart [filename]).

The code is attached as the file cotest1.c.

I am wondering if I have made some simple error or ran across a known
problem with condor. I have not used condor before. My intention is to
use it as a backend checkpointing system for openmpi's self
checkpointing hooks/callbacks. (Unfortunately I do not have kernel
access, so I can not use BLCR, for which support is explicitly built
into openmpi.) Any advise or pointers to information would be valuable.

Below is a transcript of compilation, running, and restarting my code,
using the 7.5.3 version built from source.


  Thanks,

     Tomas


=== SETTING UP ENVIRONMENT AND COMPILING ===

sh-3.2$ . condor-7.5.3/condor.sh 
sh-3.2$ condor_compile gcc -g -Wall -o cotest1 cotest1.c
LINKING FOR CONDOR : /usr/bin/ld
-L/home/oppelstrup2/condor/condor-7.5.3/lib -Bstatic --eh-frame-hdr -m
elf_i386 --hash-style=gnu -dynamic-linker /lib/ld-linux.so.2 -o
cotest1 /home/oppelstrup2/condor/condor-7.5.3/lib/condor_rt0.o /usr/lib/gcc/i386-redhat-linux/4.1.2/../../../crti.o /usr/lib/gcc/i386-redhat-linux/4.1.2/crtbeginT.o -L/home/oppelstrup2/condor/condor-7.5.3/lib -L/usr/lib/gcc/i386-redhat-linux/4.1.2 -L/usr/lib/gcc/i386-redhat-linux/4.1.2 -L/usr/lib/gcc/i386-redhat-linux/4.1.2/../../.. /tmp/cc8jGcwk.o /home/oppelstrup2/condor/condor-7.5.3/lib/libcondorzsyscall.a /home/oppelstrup2/condor/condor-7.5.3/lib/libcondor_z.a /home/oppelstrup2/condor/condor-7.5.3/lib/libcomp_libstdc++.a /home/oppelstrup2/condor/condor-7.5.3/lib/libcomp_libgcc.a /home/oppelstrup2/condor/condor-7.5.3/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /home/oppelstrup2/condor/condor-7.5.3/lib/libcomp_libgcc.a /home/oppelstrup2/condor/condor-7.5.3/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed /usr/lib/gcc/i386-redhat-linux/4.1.2/crtend.o /usr/lib/gcc/i386-redhat-linux/4.1.2/../../../crtn.o
/home/oppelstrup2/condor/condor-7.5.3/lib/libcondorzsyscall.a(condor_file_agent.o): In function `CondorFileAgent::open(char const*, int, int)':
/home/oppelstrup2/condor/condor-7.5.3-src/src/condor_ckpt/condor_file_agent.cpp:106: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
/home/oppelstrup2/condor/condor-7.5.3/lib/libcondorzsyscall.a(special_stubs.o): In function `condor_gethostbyaddr':
/home/oppelstrup2/condor/condor-7.5.3-src/src/condor_syscall_lib/special_stubs.cpp:201: warning: Using 'gethostbyaddr' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/home/oppelstrup2/condor/condor-7.5.3/lib/libcondorzsyscall.a(special_stubs.o): In function `condor_gethostbyname':
/home/oppelstrup2/condor/condor-7.5.3-src/src/condor_syscall_lib/special_stubs.cpp:194: warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/home/oppelstrup2/condor/condor-7.5.3/lib/libcondorzsyscall.a(sock.o):
In function `Sock::getportbyserv(char*)':
/home/oppelstrup2/condor/condor-7.5.3-src/src/condor_io/sock.cpp:233:
warning: Using 'getservbyname' in statically linked applications
requires at runtime the shared libraries from the glibc version used for
linking


=== RUNNING PROGRAM ===

sh-3.2$ setarch i686 -R ./cotest1 -_condor_D_ALL
User Job - $CondorPlatform: I386-LINUX_RHEL5 $
User Job - $CondorVersion: 7.5.3 Sep  1 2010 $
Condor: Notice: Will checkpoint to ./cotest1.ckpt
Condor: Notice: Remote system calls disabled.
Starting countup...
This is iteration    1
This is iteration    2
This is iteration    3
This is iteration    4
This is iteration    5
Saving checkpoint to file 'ckpt.1'...
About to send CHECKPOINT signal to SELF
Got SIGUSR2
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       173
        dups:         1
        open flags:   0x1
        url:          fd:1
        size:         173
        opens:        1
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
working dir = /home/oppelstrup2/condor
Done saving file state
About to update MyImage
Adding a DATA segment: start[0x818d000], end [0x8231000]
Image::AddSegment: name=[DATA], start=[818d000], end=[8231000],
length=[0xa4000], prot=[0x0]
Adding a STACK segment: start[0xbfff9000], end [0xbfffefff]
Image::AddSegment: name=[STACK], start=[bfff9000], end=[bfffefff],
length=[0x5fff], prot=[0x0]
Pos: 672768
Pos: 697343
Size of ckpt image = 697343 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ckpt.1
Checkpoint name is "ckpt.1"
Tmp name is "ckpt.1.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0x818d000,len=0xa4000)
I wrote 671744 bytes with write...
Wrote Segment[0] of type DATA -> OK
write(fd=3,core_loc=0xbfff9000,len=0x5fff)
I wrote 24575 bytes with write...
Wrote Segment[1] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "ckpt.1.tmp" to "ckpt.1"
Renamed OK
USER PROC: CHECKPOINT IMAGE SENT OK
Periodic Ckpt complete, doing a virtual restart...
About to restore file state
CondorFileTable::resume
working dir = /home/oppelstrup2/condor

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       173
        dups:         1
        open flags:   0x1
        not currently bound to a url.
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
Done restoring file state
About to restore signal state
About to return to user code
This is iteration    6
This is iteration    7
This is iteration    8
This is iteration    9
This is iteration   10
Saving checkpoint to file 'ckpt.2'...
About to send CHECKPOINT signal to SELF
Got SIGUSR2
Saved signal state.
About to save file state
CondorFileTable::checkpoint

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       326
        dups:         1
        open flags:   0x1
        url:          fd:1
        size:         326
        opens:        1
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
working dir = /home/oppelstrup2/condor
Done saving file state
About to update MyImage
Adding a DATA segment: start[0x818d000], end [0x8231000]
Image::AddSegment: name=[DATA], start=[818d000], end=[8231000],
length=[0xa4000], prot=[0x0]
Adding a STACK segment: start[0xbfff9000], end [0xbfffefff]
Image::AddSegment: name=[STACK], start=[bfff9000], end=[bfffefff],
length=[0x5fff], prot=[0x0]
Pos: 672768
Pos: 697343
Size of ckpt image = 697343 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ckpt.2
Checkpoint name is "ckpt.2"
Tmp name is "ckpt.2.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0x818d000,len=0xa4000)
I wrote 671744 bytes with write...
Wrote Segment[0] of type DATA -> OK
write(fd=3,core_loc=0xbfff9000,len=0x5fff)
I wrote 24575 bytes with write...
Wrote Segment[1] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "ckpt.2.tmp" to "ckpt.2"
Renamed OK
USER PROC: CHECKPOINT IMAGE SENT OK
Periodic Ckpt complete, doing a virtual restart...
About to restore file state
CondorFileTable::resume
working dir = /home/oppelstrup2/condor

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       326
        dups:         1
        open flags:   0x1
        not currently bound to a url.
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
Done restoring file state
About to restore signal state
About to return to user code
This is iteration   11
This is iteration   12
This is iteration   13

sh-3.2$ ls -ltr
total 394416
[ --- some skipped files --- ]
-rwxr-xr-x  1 oppelstrup2 8980   4829542 Sep  2 13:24 cotest1
-rw-r--r--  1 oppelstrup2 8980    697343 Sep  2 13:25 ckpt.1
-rw-r--r--  1 oppelstrup2 8980    697343 Sep  2 13:26 ckpt.2


=== RESTARTING PROGRAM ===

sh-3.2$ setarch i686 -R ./cotest1 -_condor_D_ALL -load ckpt.1
User Job - $CondorPlatform: I386-LINUX_RHEL5 $
User Job - $CondorVersion: 7.5.3 Sep  1 2010 $
Condor: Notice: Will checkpoint to ./cotest1.ckpt
Condor: Notice: Remote system calls disabled.
Loading checkpoint 'ckpt.1' and restarting...
Read headers OK
Read SegMap[0](DATA) OK
Read SegMap[1](STACK) OK
Read all SegMaps OK
Restoring a DATA segment
Found a DATA block, increasing heap from 0x8231000 to 0x8231000
About to overwrite 671744 bytes starting at 0x818d000(DATA)
About to execute on TmpStk
About to execute on tmpstack.
Beginning Execution on TmpStack.
RestoreStack() Entrance!
Restoring a STACK segment
About to overwrite 24575 bytes starting at 0xbfff9000(STACK)
RestoreStack() Exit!
About to restore file state
CondorFileTable::resume
working dir = /home/oppelstrup2/condor

OPEN FILE TABLE:
fd 0
        logical name: default stdin
        offset:       0
        dups:         1
        open flags:   0x0
        not currently bound to a url.
fd 1
        logical name: default stdout
        offset:       173
        dups:         1
        open flags:   0x1
        not currently bound to a url.
fd 2
        logical name: default stderr
        offset:       0
        dups:         1
        open flags:   0x1
        not currently bound to a url.
Done restoring file state
About to restore signal state
About to return to user code
Segmentation fault
sh-3.2$ 

#include <stdio.h>
#include <string.h>
#include <unistd.h>

void init_image_with_file_name(char *);
void ckpt(void);
void restart(void);

extern int condor_compress_ckpt;

int main(int argc, char *argv[]) {
  if(argc == 3 && strcmp(argv[1],"-load") == 0) {
    printf("Loading checkpoint \'%s\' and restarting...\n",
	   argv[2]);
    init_image_with_file_name(argv[2]);
    condor_compress_ckpt = 0;
    restart();
    printf("It should not be possible to get here...\n");
  } else {
    int i = 0,interval = 5;
    printf("Starting countup...\n");
    while(i+1 > 0) {
      i = i+1;
      sleep(1);
      printf("This is iteration %4d\n",i);

      if(i % interval == 0) {
	char cpname[80];
	sprintf(cpname,"ckpt.%d",i/interval);
	printf("Saving checkpoint to file \'%s\'...\n",cpname);
	init_image_with_file_name(cpname);
	condor_compress_ckpt = 0;
	ckpt();
      }
    }
  }
  return 0;
}