Condor is installed to an nfs share that is visible on all nodes. Users are in /etc/passwd and on all nodes, home directories are all nfs shared too. Currently, I'm trying to run an MPI job via lam using lamscript.
I've altered the lamscript in etc/examples/lamscript to have LAMDIR=/usr/bin
I'm able to ssh to any node in the cluster w/o being prompted for a password. I'm able to run the job directly via lam and it works correctly.
I've tried two different submit files. The first version was: executable = lamscript arguments = mpigreetings machine_count = 9 universe = parallel output = out error = err log = log notification = Always InitialDir = /space/hbrown/condor-test +WantIOProxy=True queueCondor processes this job and I get "Can't connect to chirp server" eight times in the error log. In the output log, I get "error 0 chirp putting identity keys back" nine times.
The second version was: executable = lamscript arguments = mpigreetings machine_count = 9 universe = parallel output = out error = err log = log should_transfer_files = yes when_to_transfer_output = on_exit notification = Always InitialDir = /space/hbrown/condor-test +WantIOProxy=True queueWhen I run this, the log file has this message (or similar based on the cpu it tries to run on) for every time it tries to start the job:
007 (083.000.000) 10/22 16:49:18 Shadow exception!Error from starter on vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx: File /cluster/condor/condor-6.8.6/hosts/node00/spool/cluster83.proc0.subproc0/0.key maps to url local:/cluster/condor/condor-6.8.6/hosts/node00/spool/cluster83.proc0.subproc0/0.key, which I don't know how to open.
The job just sits idle, until I remove it.Based on the contents of the lamscript, I'm guessing it never gets past the line ". $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS"
Which avenues should I pursue for either getting the chirp server working or helping condor open up a local:/... style url?
Thanks, Hugh -- System Administrator DIVMS Computer Support Group University of Iowa Email: hbrown@xxxxxxxxxxxxxxx Voice: 319-335-0748
Description: S/MIME Cryptographic Signature