[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Parallel Universe MPI Job Killed



Hi All;
 
I have a question regarding the way condor handles the parallel universe and the starting and stoping of processes on different machines.
 
If a two node mpi job is submitted using a parallel universe and one node completes before another, does condor kill the other node that does not complete?
 
In a previous thread:
https://lists.cs.wisc.edu/archive/condor-users/2006-February/msg00305.shtml
 
Erik Paulson indicated that it did.
 
" ... The parallel universe has one extra feature: If the first process Condor
started on the first machine Condor started processes on exits, Condor
will kill all remaining processes on every other machine. ..."
 
I was wondering if the output in my StartLog is any indication of the proper killing of the second process or is it an indication of an error? The first node completed successfully.
 
Thanks...
 
Jeff
 
The machine that was removed, below is its startlog::
 
7/24 20:07:04 Remote global job ID is stengal.cs.sunyit.edu#1185322047#319.0
7/24 20:07:04 JobLeaseDuration defined in job ClassAd: 1200
7/24 20:07:04 Resetting ClaimLease timer (19) with new duration
7/24 20:07:04 About to Create_Process "condor_starter -f stengal.cs.sunyit.edu"
7/24 20:07:04 ProcAPI::buildFamily() Found daddypid on the system: 5324
7/24 20:07:04 Got RemoteUser (wells@xxxxxxxxxxxxxxxxxxxxx) from request classad
7/24 20:07:04 Got universe "PARALLEL" (11) from request classad
7/24 20:07:04 State change: claim-activation protocol successful
7/24 20:07:04 Changing activity: Idle -> Busy
7/24 20:07:05 DaemonCore: Command received via UDP from host <192.168.0.40:33369>
7/24 20:07:05 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand)
7/24 20:07:06 DaemonCore: Command received via TCP from host <192.168.0.60:38711>
7/24 20:07:06 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
7/24 20:07:06 Called deactivate_claim_forcibly()
7/24 20:07:06 In Starter::kill() with pid 5324, sig 3 (SIGQUIT)
7/24 20:07:07 condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source.
7/24 20:07:07 IO: EOF reading packet header
7/24 20:07:07 attempt to connect to <192.168.0.40:40894> failed: Connection refused (connect errno = 111).
7/24 20:07:07 ERROR: SECMAN:2004:Failed to start a session to <192.168.0.40:40894> with TCP|SECMAN:2003:TCP connection to <192.168.0.40:40894> failed
 
7/24 20:07:07 Send_Signal: ERROR Connect to <192.168.0.40:40894> failed.
7/24 20:07:07 Error sending signal to starter, errno = 25 (Inappropriate ioctl for device)
7/24 20:07:07 In Starter::kill_kids() with pid 5324, sig 9 (SIGKILL)
7/24 20:07:07 ProcAPI::buildFamily() Found daddypid on the system: 5324
7/24 20:07:07 DaemonCore: Command received via TCP from host <192.168.0.60:45978>
7/24 20:07:07 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler)
7/24 20:07:07 Called deactivate_claim()
7/24 20:07:07 In Starter::kill() with pid 5324, sig 15 (SIGTERM)
7/24 20:07:07 attempt to connect to <192.168.0.40:40894> failed: Connection refused (connect errno = 111).
7/24 20:07:07 ERROR: SECMAN:2003:TCP auth connection to <192.168.0.40:40894> failed
 
7/24 20:07:07 Send_Signal: ERROR Connect to <192.168.0.40:40894> failed.
7/24 20:07:07 Error sending signal to starter, errno = 25 (Inappropriate ioctl for device)
7/24 20:07:07 In Starter::kill_kids() with pid 5324, sig 9 (SIGKILL)
7/24 20:07:07 ProcAPI::buildFamily() Found daddypid on the system: 5324
7/24 20:07:07 DaemonCore: No more children processes to reap.
7/24 20:07:07 Starter pid 5324 exited with status 4
7/24 20:07:07 ProcAPI::buildFamily failed: parent 5324 not found on system.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist.
7/24 20:07:07 Attempting to remove /home/condor/execute/dir_5324 as SuperUser (root)
7/24 20:07:07 State change: starter exited
7/24 20:07:07 Changing activity: Busy -> Idle