[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Chirp server error?



After adding the +WantIOProxy=True the job still fails. The following informaiton is from one of the nodes StarterLog.vm{1,2} that the process is trying to run on. I was successful in getting this benchmark to run in the MPI universe, but since it is being replaced by the parallel universe I need it to run there.
 
VM #1
 
8/23 06:12:32 ******************************************************
8/23 06:12:32 ** condor_starter (CONDOR_STARTER) STARTING UP
8/23 06:12:32 ** /opt/condor/sbin/condor_starter
8/23 06:12:32 ** $CondorVersion: 6.7.19 May 10 2006 $
8/23 06:12:32 ** $CondorPlatform: I386-LINUX_RH9 $
8/23 06:12:32 ** PID = 11855
8/23 06:12:32 ** Log last touched 8/22 17:32:34
8/23 06:12:32 ******************************************************
8/23 06:12:32 Using config file: /etc/condor/condor_config
8/23 06:12:32 Using local config files: /opt/condor/condor_config.local /opt/condor/local.sahp4335/condor_config.local
8/23 06:12:32 DaemonCore: Command Socket at <205.137.83.1:34069>
8/23 06:12:32 Done setting resource limits
8/23 06:12:32 Communicating with shadow <205.137.83.239:42732>
8/23 06:12:32 Submitting machine is "sais079.sandia.gov"
8/23 06:12:32 Job has WantIOProxy=true
8/23 06:12:32 Initialized IO Proxy.
8/23 06:12:32 Starting a PARALLEL universe job with ID: 3210.0
8/23 06:12:32 IWD: /condor_scratch/rnclear/hpl/bin/Linux_P4_goto
8/23 06:12:32 Output file: /condor_scratch/rnclear/hpl/bin/Linux_P4_goto/ross_output.out
8/23 06:12:32 Error file: /condor_scratch/rnclear/hpl/bin/Linux_P4_goto/ross_error.out
8/23 06:12:32 About to exec /condor_scratch/rnclear/hpl/bin/Linux_P4_goto/mp1script condor_exec.exe xhpl
8/23 06:12:32 Create_Process succeeded, pid=11857
8/23 06:12:32 Got SIGQUIT.  Performing fast shutdown.
8/23 06:12:32 ShutdownFast all jobs.
8/23 06:12:33 Got SIGTERM. Performing graceful shutdown.
8/23 06:12:33 ShutdownGraceful all jobs.
8/23 06:12:33 Process exited, pid=11857, status=255
8/23 06:12:33 condor_write(): Socket closed when trying to write buffer, fd is 5
8/23 06:12:33 Buf::write(): condor_write() failed
8/23 06:12:33 Failed to send job exit status to shadow
8/23 06:12:33 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
8/23 06:12:33 Last process exited, now Starter is exiting
8/23 06:12:33 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
 
VM#2
 
8/23 06:12:32 ******************************************************
8/23 06:12:32 ** condor_starter (CONDOR_STARTER) STARTING UP
8/23 06:12:32 ** /opt/condor/sbin/condor_starter
8/23 06:12:32 ** $CondorVersion: 6.7.19 May 10 2006 $
8/23 06:12:32 ** $CondorPlatform: I386-LINUX_RH9 $
8/23 06:12:32 ** PID = 11856
8/23 06:12:32 ** Log last touched 8/22 17:32:29
8/23 06:12:32 ******************************************************
8/23 06:12:32 Using config file: /etc/condor/condor_config
8/23 06:12:32 Using local config files: /opt/condor/condor_config.local /opt/condor/local.sahp4335/condor_config.local
8/23 06:12:32 DaemonCore: Command Socket at <205.137.83.1:34070>
8/23 06:12:32 Done setting resource limits
8/23 06:12:32 Communicating with shadow <205.137.83.239:42732>
8/23 06:12:32 Submitting machine is "sais079.sandia.gov"
8/23 06:12:32 Job has WantIOProxy=true
8/23 06:12:32 Initialized IO Proxy.
8/23 06:12:32 Starting a PARALLEL universe job with ID: 3210.0
8/23 06:12:32 IWD: /condor_scratch/rnclear/hpl/bin/Linux_P4_goto
8/23 06:12:32 Output file: /condor_scratch/rnclear/hpl/bin/Linux_P4_goto/ross_output.out
8/23 06:12:32 Error file: /condor_scratch/rnclear/hpl/bin/Linux_P4_goto/ross_error.out
8/23 06:12:32 About to exec /condor_scratch/rnclear/hpl/bin/Linux_P4_goto/mp1script condor_exec.exe xhpl
8/23 06:12:32 Create_Process succeeded, pid=11869
8/23 06:12:32 Got SIGQUIT.  Performing fast shutdown.
8/23 06:12:32 ShutdownFast all jobs.
8/23 06:12:33 Got SIGTERM. Performing graceful shutdown.
8/23 06:12:33 ShutdownGraceful all jobs.
8/23 06:12:33 Process exited, pid=11869, status=255
8/23 06:12:33 condor_write(): Socket closed when trying to write buffer, fd is 5
8/23 06:12:33 Buf::write(): condor_write() failed
8/23 06:12:33 Failed to send job exit status to shadow
8/23 06:12:33 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
8/23 06:12:33 Last process exited, now Starter is exiting
8/23 06:12:33 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

 

 

 

--
Richard N. Cleary
Sandia National Laboratories
Dept. 4324 Infrastructure Computing Systems
Email: rnclear@xxxxxxxxxx
Phone: 505.845.7836

 


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Becky Gietzel
Sent: Tuesday, August 22, 2006 5:51 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Chirp server error?


Add

+WantIOProxy=True

to your submit file.

--Becky


On Aug 22, 2006, at 6:25 PM, Cleary Jr, Richard N wrote:

Hello,

I'm trying to run a linpack benchmark in the parallel universe and I am getting the following error regarding the chirp server. Any idea what it means or what I might have forgotten.

---condor_submit script

buzzard-master[rnclear]: cat ross.condor
######################################
## Example submit description file
## for MPICH 1 MPI
## works with MPICH 1.2.4, 1.2.5 and 1.2.6
######################################
universe = parallel
executable = mp1script
arguments = xhpl
machine_count = 4
environment = "P4_GLOBMEMSIZE=16777290"
+RemoteSpoolDir = "unused"
log = ross.log
output = ross_output.out
error = ross_error.out
queue

--- output of mp1script -----

buzzard-master[rnclear]: cat ross_error.out
+ _CONDOR_PROCNO=1
+ _CONDOR_NPROCS=4
++ condor_config_val libexec
+ CONDOR_SSH=/opt/condor/libexec
+ CONDOR_SSH=/opt/condor/libexec/condor_ssh
++ condor_config_val libexec
+ SSHD_SH=/opt/condor/libexec
+ SSHD_SH=/opt/condor/libexec/sshd.sh
+ . /opt/condor/libexec/sshd.sh 1 4
++ trap sshd_cleanup 15
+++ condor_config_val CONDOR_SSHD
++ SSHD=/usr/sbin/sshd
+++ condor_config_val CONDOR_SSH_KEYGEN
++ KEYGEN=/usr/bin/ssh-keygen
+++ condor_config_val libexec
++ CONDOR_CHIRP=/opt/condor/libexec
++ CONDOR_CHIRP=/opt/condor/libexec/condor_chirp
++ PORT=4444
++ _CONDOR_REMOTE_SPOOL_DIR=/opt/condor/local.buzzard-master/spool/cluster3208.proc0.subproc0
++ _CONDOR_PROCNO=1
++ _CONDOR_NPROCS=4
++ mkdir /opt/condor/local.sahp5785/execute/dir_30263/tmp
++ hostkey=/opt/condor/local.sahp5785/execute/dir_30263/tmp/hostkey
++ /bin/rm -f /opt/condor/local.sahp5785/execute/dir_30263/tmp/hostkey /opt/condor/local.sahp5785/execute/dir_30263/tmp/hostkey.++ /++ /usr/bin/ssh-keygen -q -f /opt/condor/local.sahp4615/execute/dir_30531/tmp/hostkey -t rsa -++ '[++ '[' 0 -ne 0 ']'

++ idkey=/opt/condor/local.sahp4661/execute/dir_30568/tmp/3.key
++ /usr/bin/ssh-keygen -q -f /opt/condor/local.sahp4661/execute/dir_30568/tmp/3.key -t rsa -N ''
+++ sshd_cleanup
+++ /bin/rm -f /opt/condor/local.sahp4661/execute/dir_30568/tmp/hostkey /opt/condor/local.sahp4661/execute/dir_30568/tmp/hostkey.pub /opt/condor/local.sahp4661/execute/dir_30568/tmp/3.key /opt/condor/local.sahp4661/execute/dir_30568/tmp/3.key.pub sshd.out /opt/condor/local.sahp4661/execute/dir_30568/contact

++ '[' 0 -ne 0 ']'
++ /opt/condor/libexec/condor_chirp put -perm 0700 /opt/condor/local.sahp4661/execute/dir_30568/tmp/3.key /opt/condor/local.buzzard-master/spool/cluster3208.proc0.subproc0/3.key

Can't connect to chirp server
++ '[' 255 -ne 0 ']'
++ echo error 0 chirp putting identity keys back
++ exit -1
--
Richard N. Cleary
Sandia National Laboratories
Dept. 4324 Infrastructure Computing Systems
Email: rnclear@xxxxxxxxxx
Phone: 505.845.7836


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at either