[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] shadow exited



> Scott Koranda wrote:
> 
> > So I have examined the
> > 
> > - StartLog
> > - SchedLog
> > - NegotiatorLog
> > - MatchLog
> > 
> > but I can't find a reason for why this job is cycling over and
> > over again onto machines only to be forcibly evicted.
> > 
> > Any help with this investigation would be really appreciated.
> 
> You really want to check the ShadowLog...
> 
>     fgrep "(584394.0)" ShadowLog
> 
> This is where I usually find the best information.
> 

So I am seeing repeats of this cycle in the ShadowLog:

6/16 05:34:37 (584394.0) (11192):Shadow: Entering send_job()
6/16 05:34:37 (584394.0) (11192):send capability <129.89.200.233:35806>#1340272310
6/16 05:34:37 (584394.0) (11192):Requesting Primary Starter
6/16 05:34:37 (584394.0) (11192):Shadow: Request to run a job was ACCEPTED
6/16 05:34:37 (584394.0) (11192):host = medusa-slave223.medusa.phys.uwm.edu inet_addr = 0xe9c85981 port1 = 36774 port2 = 36775
6/16 05:34:37 (584394.0) (11192):Shadow: RSC_SOCK connected, fd = 17
6/16 05:34:37 (584394.0) (11192):Shadow: CLIENT_LOG connected, fd = 18
6/16 05:34:37 (584394.0) (11192):UserLog = /people/duncan/log/inspiral_pipe.dag.log.3
6/16 05:34:37 (584394.0) (11192):My_Filesystem_Domain = "medusa.phys.uwm.edu"
6/16 05:34:37 (584394.0) (11192):My_UID_Domain = "medusa.phys.uwm.edu"
6/16 05:34:37 (584394.0) (11192):HandleSyscalls: about to chdir(/mnt/scratch/duncan/macho/2004052002_l1_injections_45)
6/16 05:34:37 (584394.0) (11192):       Entering pseudo_get_file_stream
6/16 05:34:37 (584394.0) (11192):       file = "/opt/condor/home/spool/cluster584394.ickpt.subproc0"
6/16 05:34:37 (584394.0) (11192):       len = 9703054
6/16 05:34:37 (584394.0) (11192):       129.89.201.233
6/16 05:34:37 (584394.0) (11192):       129.89.201.233
6/16 05:34:37 (584394.0) (11192):       connect_sock = 3
6/16 05:34:37 (584394.0) (11192):       Listening...
6/16 05:34:37 (584394.0) (11192):       Port = 46118
6/16 05:34:37 (584394.0) (11193):       Got data connection at fd 4
6/16 05:34:37 (584394.0) (11193):       Should Send 9703054 bytes of data
6/16 05:34:38 (584394.0) (11193):       Child Shadow: STREAM FILE XFER COMPLETE - 9703054 bytes
6/16 05:34:38 (584394.0) (11192):Reaped child status - pid 11193 exited with status 0
6/16 05:34:38 (584394.0) (11192):Read: User Job - $CondorPlatform: I386-LINUX-RH72 $
6/16 05:34:38 (584394.0) (11192):User job is compatible with this shadow version6/16 05:34:38 (584394.0) (11192):Read: User Job - $CondorVersion: 6.6.1 Feb  5 2004 $
6/16 05:34:38 (584394.0) (11192):Read: Checkpoint file name is "/opt/condor/home/spool/cluster584394.proc0.subproc0"
6/16 05:36:33 (584394.0) (11192):send capability <129.89.200.233:35806>#1340272310
6/16 05:36:33 (584394.0) (11192):Sent command 404 to startd at <129.89.200.233:35806> with cap <129.89.200.233:35806>#1340272310
6/16 05:36:33 (584394.0) (11192):Shadow: Job 584394.0 exited, termsig = 9, coredump = 0, retcode = 0

Question: What does termsig = 9 mean? 

Is that the return value the checkpointed image (the
executable, the user's code) is returning or does it mean that
the checkpointed image received signal 9, a SIGKILL?

I presume it means that the checkpointed image received a
SIGKILL. If that is so, how can I figure out 

a) what sent the SIGKILL?
b) why was it sent?

What log file should I be looking in to understand why this
job is being killed?

(This is the rest of the information from the ShadowLog)

6/16 05:36:33 (584394.0) (11192):Entering Wrapup()
6/16 05:36:33 (584394.0) (11192):handle_termination() called.
6/16 05:36:33 (584394.0) (11192):Shadow: Job was kicked off without a checkpoint6/16 05:36:33 (584394.0) (11192):Shadow: Entered DoCleanup()
6/16 05:36:33 (584394.0) (11192):Shadow: DoCleanup: unlinking TmpCkpt '/opt/condor/home/spool/cluster584394.proc0.subproc0.tmp'
6/16 05:36:33 (584394.0) (11192):Trying to unlink /opt/condor/home/spool/cluster584394.proc0.subproc0.tmp
6/16 05:36:33 (584394.0) (11192):Remove from ckpt server returns -1
6/16 05:36:33 (584394.0) (11192):user_time = 4 ticks
6/16 05:36:33 (584394.0) (11192):sys_time = 3 ticks
6/16 05:36:47 (584394.0) (11192):AUTHENTICATE_FS: used file /tmp/qmgr_BwWtKD, status: 1
6/16 05:36:47 (584394.0) (11192):Entering update_rusage()
6/16 05:36:47 (584394.0) (11192):Entering update_rusage()
6/16 05:36:47 (584394.0) (11192):TIME DEBUG 3 USR remotep=0 Proc=0 utime=0.000000
6/16 05:36:47 (584394.0) (11192):TIME DEBUG 4 SYS remotep=0 Proc=0 utime=0.000000
6/16 05:36:47 (584394.0) (11192):********** Shadow Exiting(107) **********

I have also just examined the StarterLog on a machine where
this tob tried to run (corresponding to the ShadowLog entry
above) and I see this:

6/16 05:34:38 Started user job - PID = 11054
6/16 05:34:38 cmd_fp = 0x8361128
6/16 05:34:38 end
6/16 05:34:38   *FSM* Transitioning to state "SUPERVISE"
6/16 05:34:38   *FSM* Executing state func "supervise_all()" [ GET_NEW_PROC SUSPEND VACATE ALARM DIE CHILD_EXIT PERIODIC_CKPT ]
6/16 05:34:38 Set alarm for 300 seconds
6/16 05:36:32   *FSM* Got asynchronous event "CHILD_EXIT"
6/16 05:36:33   *FSM* Executing transition function "reaper"
6/16 05:36:33 Canceled alarm
6/16 05:36:33 Process 11054 exited, searching process list...
6/16 05:36:33 Found object for process 11054
6/16 05:36:33 Process 11054 killed by signal 9
6/16 05:36:33 Process exited by request
6/16 05:36:33 No core file was created
6/16 05:36:33   *FSM* Transitioning to state "PROC_EXIT"
6/16 05:36:33   *FSM* Executing state func "proc_exit()" [ DIE ]
6/16 05:36:33   *FSM* Executing transition function "dispose_one"
6/16 05:36:33 Sending final status for process 584394.0
6/16 05:36:33 STATUS encoded as CKPT, *NOT* TRANSFERRED
6/16 05:36:33 User time = 0.000000 seconds
6/16 05:36:33 System time = 0.000000 seconds
6/16 05:36:33 Done sending final status for process 584394.0

So it very much looks like the user's job is being sent a
signal 9 (SIGKILL).

What is sending it?

Thanks for any insights,

Scott