Hi All;
I have completed installing condor 6.8.5 on three
Fedora Core 6 machines.
All of my universes appear to be working
satisfactorily, however I wanted to verify whether the parallel universe is
behaving correctly. When I submit the job below, only one of the two
machines return output. The other machine indicates the user terminated
the job (see e-mail below). I read in an on-line posting how
this behavior is correct. The posting stated that once a node returns
output from a member of the cluster in a parallel universe job, the remaining
nodes are killed. If this is correct? Is my log output for my machine
correct? Is this the proper way of terminating a node that has not
completed processing? If not, does any one see were I am making a
mistake.
Thanking you all for your time.
Jeff
My parallel job:
#########################
# Parallel Job # ######################### universe = parallel executable = /bin/hostname machine_count = 2 log = parallellogfile output = outfileMPI.$(NODE) error = errfileMPI.$(NODE) should_transfer_files = YES when_to_transfer_output = ON_EXIT queue
I receive correct output from one of the two
machines that are in my cluster. The e-mail received at the completion of
the job is below and indicates that the second machines job was removed by the
user. Which I did not knowingly do. Possibly be a configuration
error?
From condor@xxxxxxxxxxxxxxxxxxxxx
Tue Jul 24 20:54:18 2007
Return-Path: <condor@xxxxxxxxxxxxxxxxxxxxx> Received: from stengal.cs.sunyit.edu (localhost.localdomain [127.0.0.1]) by stengal.cs.sunyit.edu (8.13.8/8.13.8) with ESMTP id l6P0sIOg025549 for <wells@xxxxxxxxxxxxxxxxxxxxx>; Tue, 24 Jul 2007 20:54:18 -0400 Received: (from condor@localhost) by stengal.cs.sunyit.edu (8.13.8/8.13.8/Submit) id l6P0sIWt025548 for wells@xxxxxxxxxxxxxxxxxxxxx; Tue, 24 Jul 2007 20:54:18 -0400 Date: Tue, 24 Jul 2007 20:54:18 -0400 From: condor@xxxxxxxxxxxxxxxxxxxxx Message-Id: <200707250054.l6P0sIWt025548@xxxxxxxxxxxxxxxxxxxxx> To: wells@xxxxxxxxxxxxxxxxxxxxx Subject: [Condor] Condor Job 321.0 This is an automated email from the Condor
system
on machine "stengal.cs.sunyit.edu". Do not reply. Your Condor-MPI job 321.0 has
completed.
Here are the machines that ran your MPI
job.
They are listed in the order they were started in, which is the same as MPI_Comm_rank. Machine
Name
Result
------------------------ ----------- jeter.cs.sunyit.edu exited normally with status 0 dimagio.cs.sunyit.edu was removed by the user Have a nice day.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Questions about this message or Condor in general? Email address of the local Condor administrator: wellsj1@xxxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor The machine that was removed, below is its startlog::
7/24 20:07:04 Remote global job ID is
stengal.cs.sunyit.edu#1185322047#319.0
7/24 20:07:04 JobLeaseDuration defined in job ClassAd: 1200 7/24 20:07:04 Resetting ClaimLease timer (19) with new duration 7/24 20:07:04 About to Create_Process "condor_starter -f stengal.cs.sunyit.edu" 7/24 20:07:04 ProcAPI::buildFamily() Found daddypid on the system: 5324 7/24 20:07:04 Got RemoteUser (wells@xxxxxxxxxxxxxxxxxxxxx) from request classad 7/24 20:07:04 Got universe "PARALLEL" (11) from request classad 7/24 20:07:04 State change: claim-activation protocol successful 7/24 20:07:04 Changing activity: Idle -> Busy 7/24 20:07:05 DaemonCore: Command received via UDP from host <192.168.0.40:33369> 7/24 20:07:05 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand) 7/24 20:07:06 DaemonCore: Command received via TCP from host <192.168.0.60:38711> 7/24 20:07:06 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 7/24 20:07:06 Called deactivate_claim_forcibly() 7/24 20:07:06 In Starter::kill() with pid 5324, sig 3 (SIGQUIT) 7/24 20:07:07 condor_read(): recv() returned -1, errno = 104, assuming failure reading 5 bytes from unknown source. 7/24 20:07:07 IO: EOF reading packet header 7/24 20:07:07 attempt to connect to <192.168.0.40:40894> failed: Connection refused (connect errno = 111). 7/24 20:07:07 ERROR: SECMAN:2004:Failed to start a session to <192.168.0.40:40894> with TCP|SECMAN:2003:TCP connection to <192.168.0.40:40894> failed 7/24 20:07:07 Send_Signal: ERROR Connect to <192.168.0.40:40894>
failed.
7/24 20:07:07 Error sending signal to starter, errno = 25 (Inappropriate ioctl for device) 7/24 20:07:07 In Starter::kill_kids() with pid 5324, sig 9 (SIGKILL) 7/24 20:07:07 ProcAPI::buildFamily() Found daddypid on the system: 5324 7/24 20:07:07 DaemonCore: Command received via TCP from host <192.168.0.60:45978> 7/24 20:07:07 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler) 7/24 20:07:07 Called deactivate_claim() 7/24 20:07:07 In Starter::kill() with pid 5324, sig 15 (SIGTERM) 7/24 20:07:07 attempt to connect to <192.168.0.40:40894> failed: Connection refused (connect errno = 111). 7/24 20:07:07 ERROR: SECMAN:2003:TCP auth connection to <192.168.0.40:40894> failed 7/24 20:07:07 Send_Signal: ERROR Connect to <192.168.0.40:40894>
failed.
7/24 20:07:07 Error sending signal to starter, errno = 25 (Inappropriate ioctl for device) 7/24 20:07:07 In Starter::kill_kids() with pid 5324, sig 9 (SIGKILL) 7/24 20:07:07 ProcAPI::buildFamily() Found daddypid on the system: 5324 7/24 20:07:07 DaemonCore: No more children processes to reap. 7/24 20:07:07 Starter pid 5324 exited with status 4 7/24 20:07:07 ProcAPI::buildFamily failed: parent 5324 not found on system. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 ProcAPI::getProcInfo() pid 5324 does not exist. 7/24 20:07:07 Attempting to remove /home/condor/execute/dir_5324 as SuperUser (root) 7/24 20:07:07 State change: starter exited 7/24 20:07:07 Changing activity: Busy -> Idle |