[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] restart after checkpointing



Hi,

I'm trying to make sure that out condor pool is correctly restarting code
after checkpointing. We have set a checkpoint server up on each node of the
beowulf and it looks like when code is preemted it correctly goes to the
checkpoint server.

However, I can't tell from the StarterLog if it is correctly started the
checkpointed code again. I have atached a log file with an example. It looks
like the node is correctly starting the checkpointed version, but the only
thing that confused me was the execve() message. It seems that the
checkpointed image is being started with the full command line options rather
than the -_condor_restart option. (This is how I tested with standalone
checkpointing).

Cheers,
Duncan.

-- 
Duncan Brown                                  University of Wisconsin-Milwaukee
duncan@xxxxxxxxxxxxxxxxxxxx                 Physics Department, 1900 E. Kenwood
http://www.lsc-group.phys.uwm.edu/~duncan              Milwaukee, WI 53211, USA
3/14 13:50:07 ********** STARTER starting up ***********
3/14 13:50:07 ** $CondorVersion: 6.6.1 Feb  5 2004 $
3/14 13:50:07 ** $CondorPlatform: I386-LINUX-RH9 $
3/14 13:50:07 ******************************************
3/14 13:50:07 Submitting machine is "ldas-gridi.ldas-cit"
3/14 13:50:07 EventHandler {
3/14 13:50:07 	func = 0x806db6e
3/14 13:50:07 	mask = SIGALRM SIGHUP SIGINT SIGUSR1 SIGUSR2 SIGCHLD SIGTSTP 
3/14 13:50:07 }
3/14 13:50:07 Done setting resource limits
3/14 13:50:07 	*FSM* Transitioning to state "GET_PROC"
3/14 13:50:07 	*FSM* Executing state func "get_proc()" [  ]
3/14 13:50:07 Entering get_proc()
3/14 13:50:07 Entering get_job_info()
3/14 13:50:07 Startup Info:
3/14 13:50:07 	Version Number: 1
3/14 13:50:07 	Id: 55159.0
3/14 13:50:07 	JobClass: STANDARD
3/14 13:50:07 	Uid: 620
3/14 13:50:07 	Gid: 13
3/14 13:50:07 	VirtPid: -1
3/14 13:50:07 	SoftKillSignal: 20
3/14 13:50:07 	Cmd: "/dso-test/duncan/macho/2004031204_l1_playground/lalapps_inspiral"
3/14 13:50:07 	Args: "--inverse-spec-length 16 --segment-length 1048576 --low-frequency-cutoff 100.0 --pad-data 8 --enable-high-pass 100.0 --bank-file L1-TMPLTBANK-730724450-2048.xml --sample-rate 4096 --chisq-threshold 5.0 --enable-event-cluster --calibration-cache cache_files/L1-CAL-V03-729273600-734367600.cache --high-pass-order 8 --gps-end-time 730726498 --channel-name L1:LSC-AS_Q --segment-overlap 524288 --snr-threshold 6.0 --frame-cache cache/L-730722542-730726506.cache --number-of-segments 15 --trig-start-time 730724534 --dynamic-range-exponent 69.0 --minimal-match 0.9 --approximant TaylorF2 --debug-level 33 --gps-start-time 730724450 --resample-filter ldas --enable-output --spectrum-type median --high-pass-attenuation 0.1 --chisq-bins 15"
3/14 13:50:07 	Env: "KMP_LIBRARY=serial;MKL_SERIAL=yes"
3/14 13:50:07 	Iwd: "/dso-test/duncan/macho/2004031204_l1_playground"
3/14 13:50:07 	Ckpt Wanted: TRUE
3/14 13:50:07 	Is Restart: TRUE
3/14 13:50:07 	Core Limit Valid: TRUE
3/14 13:50:07 	Coredump Limit 0
3/14 13:50:07 User uid set to 620
3/14 13:50:07 User uid set to 13
3/14 13:50:07 User Process 55159.0 {
3/14 13:50:07   cmd = /dso-test/duncan/macho/2004031204_l1_playground/lalapps_inspiral
3/14 13:50:07   args = --inverse-spec-length 16 --segment-length 1048576 --low-frequency-cutoff 100.0 --pad-data 8 --enable-high-pass 100.0 --bank-file L1-TMPLTBANK-730724450-2048.xml --sample-rate 4096 --chisq-threshold 5.0 --enable-event-cluster --calibration-cache cache_files/L1-CAL-V03-729273600-734367600.cache --high-pass-order 8 --gps-end-time 730726498 --channel-name L1:LSC-AS_Q --segment-overlap 524288 --snr-threshold 6.0 --frame-cache cache/L-730722542-730726506.cache --number-of-segments 15 --trig-start-time 730724534 --dynamic-range-exponent 69.0 --minimal-match 0.9 --approximant TaylorF2 --debug-level 33 --gps-start-time 730724450 --resample-filter ldas --enable-output --spectrum-type median --high-pass-attenuation 0.1 --chisq-bins 15
3/14 13:50:07   env = KMP_LIBRARY=serial;MKL_SERIAL=yes
3/14 13:50:07   local_dir = dir_3704
3/14 13:50:07   cur_ckpt = dir_3704/condor_exec.55159.0
3/14 13:50:07   core_name = dir_3704/core
3/14 13:50:07   uid = 620, gid = 13
3/14 13:50:07   v_pid = -1
3/14 13:50:07   pid = (NOT CURRENTLY EXECUTING)
3/14 13:50:07   exit_status_valid = FALSE
3/14 13:50:07   exit_status = (NEVER BEEN EXECUTED)
3/14 13:50:07   ckpt_wanted = TRUE
3/14 13:50:07   coredump_limit_exists = TRUE
3/14 13:50:07   coredump_limit = 0
3/14 13:50:07   soft_kill_sig = 20
3/14 13:50:07   job_class = STANDARD
3/14 13:50:07   state = NEW
3/14 13:50:07   new_ckpt_created = FALSE
3/14 13:50:07   ckpt_transferred = FALSE
3/14 13:50:07   core_created = FALSE
3/14 13:50:07   core_transferred = FALSE
3/14 13:50:07   exit_requested = FALSE
3/14 13:50:07   image_size = -1 blocks
3/14 13:50:07   user_time = 0
3/14 13:50:07   sys_time = 0
3/14 13:50:07   guaranteed_user_time = 0
3/14 13:50:07   guaranteed_sys_time = 0
3/14 13:50:07 }
3/14 13:50:07 	*FSM* Transitioning to state "GET_EXEC"
3/14 13:50:07 	*FSM* Executing state func "get_exec()" [ SUSPEND VACATE DIE  ]
3/14 13:50:07 Entering get_exec()
3/14 13:50:07 Executable is located on submitting host
3/14 13:50:07 Expanded executable name is "/usr1/condor-logs/spool/cluster55159.ickpt.subproc0"
3/14 13:50:07 Going to try 3 attempts at getting the inital executable
3/14 13:50:07 Entering get_file( /usr1/condor-logs/spool/cluster55159.ickpt.subproc0, dir_3704/condor_exec.55159.0, 0755 )
3/14 13:50:07 Opened "/usr1/condor-logs/spool/cluster55159.ickpt.subproc0" via file stream
3/14 13:50:07 Get_file() transferred 9550472 bytes, 45182396 bytes/second
3/14 13:50:07 Fetched orig ckpt file "/usr1/condor-logs/spool/cluster55159.ickpt.subproc0" into "dir_3704/condor_exec.55159.0" with 1 attempt
3/14 13:50:07 Executable 'dir_3704/condor_exec.55159.0' is linked with "$CondorVersion: 6.6.1 Feb  5 2004 $" on a "$CondorPlatform: I386-LINUX-RH72 $"
3/14 13:50:07 	*FSM* Executing transition function "spawn_all"
3/14 13:50:07 Pipe built
3/14 13:50:07 New pipe_fds[14,1]
3/14 13:50:07 cmd_fd = 14
3/14 13:50:07 Calling execve( "/opt/condor-6.6.1/local/execute/dir_3704/condor_exec.55159.0", "condor_exec.55159.0", "-_condor_cmd_fd", "14", "--inverse-spec-length", "16", "--segment-length", "1048576", "--low-frequency-cutoff", "100.0", "--pad-data", "8", "--enable-high-pass", "100.0", "--bank-file", "L1-TMPLTBANK-730724450-2048.xml", "--sample-rate", "4096", "--chisq-threshold", "5.0", "--enable-event-cluster", "--calibration-cache", "cache_files/L1-CAL-V03-729273600-734367600.cache", "--high-pass-order", "8", "--gps-end-time", "730726498", "--channel-name", "L1:LSC-AS_Q", "--segment-overlap", "524288", "--snr-threshold", "6.0", "--frame-cache", "cache/L-730722542-730726506.cache", "--number-of-segments", "15", "--trig-start-time", "730724534", "--dynamic-range-exponent", "69.0", "--minimal-match", "0.9", "--approximant", "TaylorF2", "--debug-level", "33", "--gps-start-time", "730724450", "--resample-filter", "ldas", "--enable-output", "--spectrum-type", "median", "--high-pass-attenuation", "0.1", "--chisq-bins", "15", 0, "KMP_LIBRARY=serial", "MKL_SERIAL=yes", "CONDOR_VM=vm2", "CONDOR_SCRATCH_DIR=/opt/condor-6.6.1/local/execute/dir_3704", 0 )
3/14 13:50:07 Started user job - PID = 3705
3/14 13:50:07 cmd_fp = 0x8361a90
3/14 13:50:07 restart
3/14 13:50:07 end
3/14 13:50:07 	*FSM* Transitioning to state "SUPERVISE"
3/14 13:50:07 	*FSM* Executing state func "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT PERIODIC_CKPT  ]