[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems with checkpointing



Hello

I am new to Condor and have a problem. I search everything to handle it, but i found no hints on what could be wrong.

I try to submit the loop.remote program as a condor job with an argument of 1000 (which makes it run for about 35 sec.). As it is being executed I run condor_vacate -all to vacate the job (checkpoint it and terminate afterwards) but it never gets checkpointed.
I am running Condor on 3 Fedora Core 5 machines.

Below I paste some log files and config files. I hope someone can help me to find the problem.

Here is the ShadowLog from the submiting machine:
6/26 09:34:35 ENABLE_RUNTIME_CONFIG is undefined, using default value of False

6/26 09:34:35 ENABLE_PERSISTENT_CONFIG is undefined, using default value of False

6/26 09:34:35 PASSWD_CACHE_REFRESH is undefined, using default value of 300

6/26 09:34:35 (?.?) (9004):******* Standard Shadow starting up *******
6/26 09:34:35 (?.?) (9004):** $CondorVersion: 6.6.11 Mar 23 2006 $
6/26 09:34:35 (?.?) (9004):** $CondorPlatform: I386-LINUX_RH9 $
6/26 09:34:35 (?.?) (9004):*******************************************
6/26 09:34:35 (?.?) (9004):*** Reserved Swap = 0
6/26 09:34:35 (?.?) (9004):*** Free Swap = 0
6/26 09:34:35 (?.?) (9004):uid=0, euid=502, gid=0, egid=502
6/26 09:34:35 (?.?) (9004):argc = 6
6/26 09:34:35 (?.?) (9004):argv[0] = condor_shadow
6/26 09:34:35 (?.?) (9004):argv[1] = <192.168.0.209:37956>
6/26 09:34:35 (?.?) (9004):argv[2] = <192.168.0.210:53866>
6/26 09:34:35 (?.?) (9004):argv[3] = <192.168.0.210:53866>#1920571837
6/26 09:34:35 (?.?) (9004):argv[4] = 48
6/26 09:34:35 (?.?) (9004):argv[5] = 0
6/26 09:34:35 (?.?) (9004):Hostname = "<192.168.0.210:53866>", Job = 48.0
6/26 09:34:35 (48.0) (9004):SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/26 09:34:35 (48.0) (9004):Shadow: Entering send_job()
6/26 09:34:35 (48.0) (9004):SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0
6/26 09:34:35 (48.0) (9004):send capability <192.168.0.210:53866>#1920571837
6/26 09:34:35 (48.0) (9004):Requesting Primary Starter
6/26 09:34:35 (48.0) (9004):Shadow: Request to run a job was ACCEPTED
6/26 09:34:35 (48.0) (9004):host = dipres-dom2.dpnet inet_addr = 0xd200a8c0 port1 = 33901 port2 = 55713
6/26 09:34:35 (48.0) (9004):Shadow: RSC_SOCK connected, fd = 17
6/26 09:34:35 (48.0) (9004):Shadow: CLIENT_LOG connected, fd = 18
6/26 09:34:35 (48.0) (9004):ENABLE_USERLOG_LOCKING is undefined, using default value of True
6/26 09:34:35 (48.0) (9004):UserLog = /home/dipres/examples/loop.log
6/26 09:34:35 (48.0) (9004):My_Filesystem_Domain = "dipres-dom1.dpnet"
6/26 09:34:35 (48.0) (9004):My_UID_Domain = "dipres-dom1.dpnet"
6/26 09:34:35 (48.0) (9004):HandleSyscalls: about to chdir(/home/dipres/examples)
6/26 09:34:35 (48.0) (9004):    Entering pseudo_get_file_stream
6/26 09:34:35 (48.0) (9004): file = "/home/condor/spool/cluster48.ickpt.subproc0"
6/26 09:34:35 (48.0) (9004):    len = 12304037
6/26 09:34:35 (48.0) (9004):     Weird 0xc0a800d1
6/26 09:34:35 (48.0) (9004):     Weird 0xc0a800d1
6/26 09:34:35 (48.0) (9004):    connect_sock = 3
6/26 09:34:35 (48.0) (9004):    Listening...
6/26 09:34:35 (48.0) (9004):    Port = 47024
6/26 09:34:35 (48.0) (9005):    Got data connection at fd 4
6/26 09:34:35 (48.0) (9005):    Should Send 12304037 bytes of data
6/26 09:34:36 (48.0) (9005): Child Shadow: STREAM FILE XFER COMPLETE - 12304037 bytes 6/26 09:34:36 (48.0) (9004):Reaped child status - pid 9005 exited with status 0
6/26 09:34:36 (48.0) (9004):Read: User Job - $CondorPlatform: I386-LINUX_RH9 $
6/26 09:34:36 (48.0) (9004):Read: User Job - $CondorVersion: 6.6.11 Mar 23 2006 $
6/26 09:34:36 (48.0) (9004):User job is compatible with this shadow version
6/26 09:34:36 (48.0) (9004):Read: Checkpoint file name is "/home/condor/spool/cluster48.proc0.subproc0"
6/26 09:35:00 (48.0) (9004):Read: Got SIGTSTP
6/26 09:35:00 (48.0) (9004):Read: Saved signal state.
6/26 09:35:00 (48.0) (9004):Read: About to save file state
6/26 09:35:00 (48.0) (9004):Read: CondorFileTable::checkpoint
6/26 09:35:00 (48.0) (9004):Read: OPEN FILE TABLE:
6/26 09:35:00 (48.0) (9004):Read: fd 0
6/26 09:35:00 (48.0) (9004):Read:       logical name: /dev/null
6/26 09:35:00 (48.0) (9004):Read:       offset:       0
6/26 09:35:00 (48.0) (9004):Read:       dups:         1
6/26 09:35:00 (48.0) (9004):Read:       open flags:   0x0
6/26 09:35:00 (48.0) (9004):Read:       url:          local:/dev/null
6/26 09:35:00 (48.0) (9004):Read:       size:         0
6/26 09:35:00 (48.0) (9004):Read:       opens:        1
6/26 09:35:00 (48.0) (9004):Read: fd 1
6/26 09:35:00 (48.0) (9004):Read: logical name: /home/dipres/examples/loop.out
6/26 09:35:00 (48.0) (9004):Read:       offset:       2614
6/26 09:35:00 (48.0) (9004):Read:       dups:         1
6/26 09:35:00 (48.0) (9004):Read:       open flags:   0x1
6/26 09:35:00 (48.0) (9004):Read: url: buffer:remote:/home/dipres/examples/loop.out
6/26 09:35:00 (48.0) (9004):Read:       size:         2614
6/26 09:35:00 (48.0) (9004):Read:       opens:        1
6/26 09:35:00 (48.0) (9004):Read: fd 2
6/26 09:35:00 (48.0) (9004):Read: logical name: /home/dipres/examples/loop.err
6/26 09:35:00 (48.0) (9004):Read:       offset:       0
6/26 09:35:00 (48.0) (9004):Read:       dups:         1
6/26 09:35:00 (48.0) (9004):Read:       open flags:   0x1
6/26 09:35:00 (48.0) (9004):Read: url: buffer:remote:/home/dipres/examples/loop.err
6/26 09:35:00 (48.0) (9004):Read:       size:         0
6/26 09:35:00 (48.0) (9004):Read:       opens:        1
6/26 09:35:00 (48.0) (9004):Read: working dir = /home/dipres/examples
6/26 09:35:00 (48.0) (9004):Read: Done saving file state
6/26 09:35:00 (48.0) (9004):Read: About to update MyImage
6/26 09:35:00 (48.0) (9004):Read: Size of ckpt image = 20255743 bytes
6/26 09:35:00 (48.0) (9004):Read: About to write checkpoint
6/26 09:35:00 (48.0) (9004):Read: Image::Write(): fd -1 file_name /home/condor/spool/cluster48.proc0.subproc0 6/26 09:35:00 (48.0) (9004):Read: Checkpoint name is "/home/condor/spool/cluster48.proc0.subproc0" 6/26 09:35:00 (48.0) (9004):Read: Tmp name is "/home/condor/spool/cluster48.proc0.subproc0.tmp"
6/26 09:35:00 (48.0) (9004):    Entering pseudo_put_file_stream
6/26 09:35:00 (48.0) (9004): file = "/home/condor/spool/cluster48.proc0.subproc0.tmp"
6/26 09:35:00 (48.0) (9004):    len = 20255743
6/26 09:35:00 (48.0) (9004):    owner = dipres
6/26 09:35:00 (48.0) (9004):     Weird 0xc0a800d1
6/26 09:35:00 (48.0) (9004):     Weird 0xc0a800d1
6/26 09:35:00 (48.0) (9004):     Weird 0xc0a800d1
6/26 09:35:00 (48.0) (9004):     Weird 0xc0a800d1
6/26 09:35:00 (48.0) (9004):    connect_sock = 3
6/26 09:35:00 (48.0) (9004):    Listening...
6/26 09:35:00 (48.0) (9004):    Port = 45095
6/26 09:35:00 (48.0) (9004):Read: Opened "/home/condor/spool/cluster48.proc0.subproc0.tmp" via file stream
6/26 09:35:00 (48.0) (9004):Read: Wrote headers OK
6/26 09:35:00 (48.0) (9004):Read: Wrote all SegMaps OK
6/26 09:35:00 (48.0) (9004):Read: write(fd=3,core_loc=0x8182000,len=0x134a000)
6/26 09:35:00 (48.0) (9012):    Got data connection at fd 4
6/26 09:35:00 (48.0) (9004):Read: in SegMap::Write(): fd = 3, write_size=19591824
6/26 09:35:00 (48.0) (9004):Read: errno=14, core_loc=821cd70
6/26 09:35:00 (48.0) (9004):Read: Write() Segment[0] of type DATA -> FAILED
6/26 09:35:00 (48.0) (9004):Read: errno = 14, nbytes = -1
6/26 09:35:00 (48.0) (9004):Read: Ckpt exit
6/26 09:35:00 (48.0) (9004):Read: Write failed with [-1]
6/26 09:35:00 (48.0) (9004):SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0 6/26 09:35:00 (48.0) (9004):SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
6/26 09:35:00 (48.0) (9004):send capability <192.168.0.210:53866>#1920571837
6/26 09:35:00 (48.0) (9004):Sent command 404 to startd at <192.168.0.210:53866> with cap <192.168.0.210:53866>#1920571837 6/26 09:35:00 (48.0) (9004):Shadow: Job 48.0 exited, termsig = 9, coredump = 0, retcode = 0
6/26 09:35:00 (48.0) (9004):Entering Wrapup()
6/26 09:35:00 (48.0) (9004):handle_termination() called.
6/26 09:35:00 (48.0) (9004):Shadow: Job was kicked off without a checkpoint
6/26 09:35:00 (48.0) (9004):Shadow: Entered DoCleanup()
6/26 09:35:00 (48.0) (9004):Shadow: DoCleanup: unlinking TmpCkpt '/home/condor/spool/cluster48.proc0.subproc0.tmp' 6/26 09:35:00 (48.0) (9004):Trying to unlink /home/condor/spool/cluster48.proc0.subproc0.tmp
6/26 09:35:01 (48.0) (9012):    STREAM FILE RECEIVED OK (-1 bytes)
6/26 09:35:01 (48.0) (9004):user_time = 0 ticks
6/26 09:35:01 (48.0) (9004):sys_time = 2 ticks
6/26 09:35:01 (48.0) (9004):SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0 6/26 09:35:01 (48.0) (9004):SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 6/26 09:35:01 (48.0) (9004):AUTHENTICATE_FS: used file /tmp/qmgr_m8ztJk, status: 1
6/26 09:35:01 (48.0) (9004):Entering update_rusage()
6/26 09:35:01 (48.0) (9004):Entering update_rusage()
6/26 09:35:01 (48.0) (9004):TIME DEBUG 3 USR remotep=0 Proc=0 utime=0.000000
6/26 09:35:01 (48.0) (9004):TIME DEBUG 4 SYS remotep=0 Proc=0 utime=0.000000
6/26 09:35:01 (48.0) (9004):********** Shadow Exiting(107) **********


condor_config.local:
COLLECTOR_NAME =
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
SUSPEND = False
LOCK = /tmp/condor-lock.$(HOSTNAME)0.986046526421891
JAVA_MAXHEAP_ARGUMENT =
CONDOR_ADMIN = root@xxxxxxxxxxxxxxxxx
START = True
MAIL = /bin/mailx
RELEASE_DIR = /usr/local
DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD,STARTD
COLLECTOR = $(SBIN)/condor_collector
PREEMPT = False
UID_DOMAIN = $(FULL_HOSTNAME)
NEGOTIATOR = $(SBIN)/condor_negotiator
JAVA = /usr/bin/java
VACATE =
CONDOR_HOST = dipres-dom1.dpnet
CONDOR_IDS = 502.502
LOCAL_DIR = /home/condor
MEMORY = 128
RESERVED_SWAP = 0
ALL_DEBUG = D_FULLDEBUG
NEGOTIATOR_IGNORE_USER_PRIORITIES = True


condor_config:
RELEASE_DIR             = /usr/local/condor
LOCAL_DIR               = $(TILDE)
LOCAL_CONFIG_FILE = /home/condor/condor_config.local
CONDOR_ADMIN            = condor-admin@xxxxxxxxxxx
MAIL                    = /usr/bin/mail
UID_DOMAIN              = your.domain
FILESYSTEM_DOMAIN       = your.domain
FLOCK_FROM =
FLOCK_TO =
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST)
HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR)
HOSTALLOW_READ = *
HOSTALLOW_WRITE = *
HOSTALLOW_NEGOTIATOR = $(NEGOTIATOR_HOST)
HOSTALLOW_NEGOTIATOR_SCHEDD = $(NEGOTIATOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_WRITE_STARTD    = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_READ_COLLECTOR  = $(HOSTALLOW_READ), $(FLOCK_FROM)
HOSTALLOW_READ_STARTD     = $(HOSTALLOW_READ), $(FLOCK_FROM)
LOCK            = $(LOG)
GLIDEIN_SERVER_NAME = gridftp.cs.wisc.edu
GLIDEIN_SERVER_DIR = /p/condor/public/binaries/glidein
ALL_DEBUG               =
MAX_COLLECTOR_LOG       = 1000000
COLLECTOR_DEBUG         =
MAX_KBDD_LOG            = 1000000
KBDD_DEBUG              =
MAX_NEGOTIATOR_LOG      = 1000000
NEGOTIATOR_DEBUG        = D_MATCH
MAX_NEGOTIATOR_MATCH_LOG = 1000000
MAX_SCHEDD_LOG          = 1000000
SCHEDD_DEBUG            = D_COMMAND
MAX_SHADOW_LOG          = 1000000
SHADOW_DEBUG            =
MAX_STARTD_LOG          = 1000000
STARTD_DEBUG            = D_COMMAND
MAX_STARTER_LOG         = 1000000
STARTER_DEBUG           = D_NODATE
MAX_MASTER_LOG          = 1000000
MASTER_DEBUG            = D_COMMAND
MINUTE          = 60
HOUR            = (60 * $(MINUTE))
StateTimer      = (CurrentTime - EnteredCurrentState)
ActivityTimer   = (CurrentTime - EnteredCurrentActivity)
ActivationTimer = (CurrentTime - JobStart)
LastCkpt        = (CurrentTime - LastPeriodicCheckpoint)
STANDARD        = 1
PVM             = 4
VANILLA         = 5
MPI             = 8
IsPVM           = (TARGET.JobUniverse == $(PVM))
IsMPI           = (TARGET.JobUniverse == $(MPI))
IsVanilla       = (TARGET.JobUniverse == $(VANILLA))
IsStandard      = (TARGET.JobUniverse == $(STANDARD))
SmallJob        = (TARGET.ImageSize <  (15 * 1024))
NonCondorLoadAvg        = (LoadAvg - CondorLoadAvg)
BackgroundLoad          = 0.3
HighLoad                = 0.5
StartIdleTime           = 15 * $(MINUTE)
ContinueIdleTime        =  5 * $(MINUTE)
MaxSuspendTime          = 10 * $(MINUTE)
MaxVacateTime           = 10 * $(MINUTE)
KeyboardBusy            = (KeyboardIdle < $(MINUTE))
ConsoleBusy             = (ConsoleIdle  < $(MINUTE))
CPUIdle                 = ($(NonCondorLoadAvg) <= $(BackgroundLoad))
CPUBusy                 = ($(NonCondorLoadAvg) >= $(HighLoad))
KeyboardNotBusy         = ($(KeyboardBusy) == False)
BigJob          = (TARGET.ImageSize >= (50 * 1024))
MediumJob = (TARGET.ImageSize >= (15 * 1024) && TARGET.ImageSize < (50 * 1024))
SmallJob        = (TARGET.ImageSize <  (15 * 1024))
JustCPU                 = ($(CPUBusy) && ($(KeyboardBusy) == False))
MachineBusy             = ($(CPUBusy) || $(KeyboardBusy))
WANT_SUSPEND            = $(UWCS_WANT_SUSPEND)
WANT_VACATE             = $(UWCS_WANT_VACATE)
START                   = $(UWCS_START)
SUSPEND                 = $(UWCS_SUSPEND)
CONTINUE                = $(UWCS_CONTINUE)
PREEMPT                 = $(UWCS_PREEMPT)
KILL                    = $(UWCS_KILL)
PERIODIC_CHECKPOINT     = $(UWCS_PERIODIC_CHECKPOINT)
PREEMPTION_REQUIREMENTS = $(UWCS_PREEMPTION_REQUIREMENTS)
PREEMPTION_RANK         = $(UWCS_PREEMPTION_RANK)
UWCS_WANT_SUSPEND       = ( $(SmallJob) || $(KeyboardNotBusy) \
UWCS_WANT_VACATE        = ( $(ActivationTimer) > 10 * $(MINUTE) \
UWCS_START      = ( (KeyboardIdle > $(StartIdleTime)) \
UWCS_SUSPEND = ( $(KeyboardBusy) || \
UWCS_CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) \
UWCS_PREEMPT = ( ((Activity == "Suspended") && \
UWCS_KILL = $(ActivityTimer) > $(MaxVacateTime)
UWCS_PERIODIC_CHECKPOINT        = $(LastCkpt) > $(MINUTE)
UWCS_PREEMPTION_REQUIREMENTS = $(StateTimer) > (1 * $(HOUR)) && RemoteUserPrio > SubmittorPrio * 1.2
UWCS_PREEMPTION_RANK = (RemoteUserPrio * 1000000) - TARGET.ImageSize
TESTINGMODE_WANT_SUSPEND        = False
TESTINGMODE_WANT_VACATE         = False
TESTINGMODE_START               = True
TESTINGMODE_SUSPEND             = False
TESTINGMODE_CONTINUE            = True
TESTINGMODE_PREEMPT             = False
TESTINGMODE_KILL                = False
TESTINGMODE_PERIODIC_CHECKPOINT = False
TESTINGMODE_PREEMPTION_REQUIREMENTS = False
TESTINGMODE_PREEMPTION_RANK = 0
LOG             = $(LOCAL_DIR)/log
SPOOL           = $(LOCAL_DIR)/spool
EXECUTE         = $(LOCAL_DIR)/execute
BIN             = $(RELEASE_DIR)/bin
LIB             = $(RELEASE_DIR)/lib
SBIN            = $(RELEASE_DIR)/sbin
HISTORY         = $(SPOOL)/history
COLLECTOR_LOG   = $(LOG)/CollectorLog
KBDD_LOG        = $(LOG)/KbdLog
MASTER_LOG      = $(LOG)/MasterLog
NEGOTIATOR_LOG  = $(LOG)/NegotiatorLog
NEGOTIATOR_MATCH_LOG = $(LOG)/MatchLog
SCHEDD_LOG      = $(LOG)/SchedLog
SHADOW_LOG      = $(LOG)/ShadowLog
STARTD_LOG      = $(LOG)/StartLog
STARTER_LOG     = $(LOG)/StarterLog
SHADOW_LOCK     = $(LOCK)/ShadowLock
COLLECTOR_HOST  = $(CONDOR_HOST)
NEGOTIATOR_HOST = $(CONDOR_HOST)
RESERVED_DISK           = 5
DAEMON_LIST                     = MASTER, STARTD, SCHEDD
MASTER                          = $(SBIN)/condor_master
STARTD                          = $(SBIN)/condor_startd
SCHEDD                          = $(SBIN)/condor_schedd
KBDD                            = $(SBIN)/condor_kbdd
NEGOTIATOR                      = $(SBIN)/condor_negotiator
COLLECTOR                       = $(SBIN)/condor_collector
GRID_MONITOR                    = $(SBIN)/grid_monitor.sh
MASTER_ADDRESS_FILE = $(LOG)/.master_address
PREEN                           = $(SBIN)/condor_preen
PREEN_ARGS                      = -m -r
STARTER_LIST = STARTER, STARTER_PVM, STARTER_STANDARD
STARTER                 = $(SBIN)/condor_starter
STARTER_PVM             = $(SBIN)/condor_starter.pvm
STARTER_STANDARD        = $(SBIN)/condor_starter.std
STARTD_ADDRESS_FILE     = $(LOG)/.startd_address
BenchmarkTimer = (CurrentTime - LastBenchmark)
RunBenchmarks : (LastBenchmark == 0 ) || ($(BenchmarkTimer) >= (4 * $(HOUR)))
CONSOLE_DEVICES = mouse, console
COLLECTOR_HOST_STRING = "$(COLLECTOR_HOST)"
STARTD_EXPRS = COLLECTOR_HOST_STRING
STARTD_JOB_EXPRS = ImageSize, ExecutableSize, JobUniverse, NiceUser
SHADOW_LIST = SHADOW, SHADOW_PVM, SHADOW_STANDARD
SHADOW                  = $(SBIN)/condor_shadow
SHADOW_PVM              = $(SBIN)/condor_shadow.pvm
SHADOW_STANDARD         = $(SBIN)/condor_shadow.std
SCHEDD_ADDRESS_FILE     = $(LOG)/.schedd_address
SHADOW_SIZE_ESTIMATE    = 1800
SHADOW_RENICE_INCREMENT = 10
PERIODIC_EXPR_INTERVAL = 60
QUEUE_SUPER_USERS       = root, condor
PVMD                    = $(SBIN)/condor_pvmd
PVMGS                   = $(SBIN)/condor_pvmgs
VALID_SPOOL_FILES       = job_queue.log, job_queue.log.tmp, history, \
INVALID_LOG_FILES       = core
JAVA = /usr/bin/java
JAVA_MAXHEAP_ARGUMENT = -Xmx
JAVA_CLASSPATH_DEFAULT = $(LIB) $(LIB)/scimark2lib.jar .
JAVA_CLASSPATH_ARGUMENT = -classpath
JAVA_CLASSPATH_SEPARATOR = :
JAVA_BENCHMARK_TIME = 2
JAVA_EXTRA_ARGUMENTS =
GRIDMANAGER                     = $(SBIN)/condor_gridmanager
GAHP                            = $(SBIN)/gahp_server
MAX_GRIDMANAGER_LOG     = 1000000
GRIDMANAGER_DEBUG       = D_COMMAND
GRIDMANAGER_LOG = /tmp/GridmanagerLog.$(USERNAME)
CRED_MIN_TIME_LEFT              = 120