Hi All;
I have a question regarding the way condor handles
the parallel universe and the starting and stoping of processes on different
machines.
If a two node mpi job is submitted using a
parallel universe and one node completes before another, does condor kill the
other node that does not complete?
In a previous thread:
Erik Paulson indicated that it did.
" ... The parallel universe has one
extra feature: If the first process Condor
started on the first machine Condor started processes on exits, Condor will kill all remaining processes on every other machine. ..." I was wondering if the output in my StartLog is any
indication of the proper killing of the second process or is it an indication of
an error? The first node completed successfully.
Thanks...
Jeff
The machine that was removed, below is its startlog::
7/24 20:07:04 Remote global job ID is
stengal.cs.sunyit.edu#1185322047#319.0
7/24 20:07:04 JobLeaseDuration defined in job ClassAd: 1200 7/24 20:07:04 Resetting ClaimLease timer (19) with new duration 7/24 20:07:04 About to Create_Process "condor_starter -f stengal.cs.sunyit.edu" 7/24 20:07:04 ProcAPI::buildFamily() Found daddypid on the system: 5324 7/24 20:07:04 Got RemoteUser (wells@xxxxxxxxxxxxxxxxxxxxx) from request classad 7/24 20:07:04 Got universe "PARALLEL" (11) from request classad 7/24 20:07:04 State change: claim-activation protocol successful 7/24 20:07:04 Changing activity: Idle -> Busy 7/24 20:07:05 DaemonCore: Command received via UDP from host <192.168.0.40:33369> 7/24 20:07:05 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand) 7/24 20:07:06 DaemonCore: Command received via TCP from host <192.168.0.60:38711> 7/24 20:07:06 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 7/24 20:07:06 Called deactivate_claim_forcibly() 7/24 20:07:06 In Starter::kill() with pid 5324, sig 3 (SIGQUIT) 7/24 20:07:07 condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source. 7/24 20:07:07 IO: EOF reading packet header 7/24 20:07:07 attempt to connect to <192.168.0.40:40894> failed: Connection refused (connect errno = 111). 7/24 20:07:07 ERROR: SECMAN:2004:Failed to start a session to <192.168.0.40:40894> with TCP|SECMAN:2003:TCP connection to <192.168.0.40:40894> failed 7/24 20:07:07 Send_Signal: ERROR Connect to <192.168.0.40:40894>
failed.
7/24 20:07:07 Error sending signal to starter, errno = 25 (Inappropriate ioctl for device) 7/24 20:07:07 In Starter::kill_kids() with pid 5324, sig 9 (SIGKILL) 7/24 20:07:07 ProcAPI::buildFamily() Found daddypid on the system: 5324 7/24 20:07:07 DaemonCore: Command received via TCP from host <192.168.0.60:45978> 7/24 20:07:07 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler) 7/24 20:07:07 Called deactivate_claim() 7/24 20:07:07 In Starter::kill() with pid 5324, sig 15 (SIGTERM) 7/24 20:07:07 attempt to connect to <192.168.0.40:40894> failed: Connection refused (connect errno = 111). 7/24 20:07:07 ERROR: SECMAN:2003:TCP auth connection to <192.168.0.40:40894> failed 7/24 20:07:07 Send_Signal: ERROR Connect to <192.168.0.40:40894>
failed. 7/24 20:07:07 Error sending signal to starter, errno = 25 (Inappropriate ioctl for device) 7/24 20:07:07 In Starter::kill_kids() with pid 5324, sig 9 (SIGKILL) 7/24 20:07:07 ProcAPI::buildFamily() Found daddypid on the system: 5324 7/24 20:07:07 DaemonCore: No more children processes to reap. 7/24 20:07:07 Starter pid 5324 exited with status 4 7/24 20:07:07 ProcAPI::buildFamily failed: parent 5324 not found on system. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 Attempting to remove /home/condor/execute/dir_5324 as SuperUser (root) 7/24 20:07:07 State change: starter exited 7/24 20:07:07 Changing activity: Busy -> Idle |