[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] troubles with lamscript/sshd.sh



I've just installed Condor 6.8.6 on a dedicated Redhat 4 cluster and I'm working through the various examples.

Condor is installed to an nfs share that is visible on all nodes. Users are in /etc/passwd and on all nodes, home directories are all nfs shared too. Currently, I'm trying to run an MPI job via lam using lamscript.

I've altered the lamscript in etc/examples/lamscript to have LAMDIR=/usr/bin

I'm able to ssh to any node in the cluster w/o being prompted for a password. I'm able to run the job directly via lam and it works correctly.

I've tried two different submit files.  The first version was:

executable = lamscript
arguments  = mpigreetings
machine_count = 9
universe   = parallel
output     = out
error      = err
log        = log
notification = Always
InitialDir = /space/hbrown/condor-test
+WantIOProxy=True
queue

Condor processes this job and I get "Can't connect to chirp server" eight times in the error log. In the output log, I get "error 0 chirp putting identity keys back" nine times.



The second version was:

executable = lamscript
arguments  = mpigreetings
machine_count = 9
universe   = parallel
output     = out
error      = err
log        = log
should_transfer_files = yes
when_to_transfer_output = on_exit
notification = Always
InitialDir = /space/hbrown/condor-test
+WantIOProxy=True
queue

When I run this, the log file has this message (or similar based on the cpu it tries to run on) for every time it tries to start the job:

007 (083.000.000) 10/22 16:49:18 Shadow exception!
Error from starter on vm1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx: File /cluster/condor/condor-6.8.6/hosts/node00/spool/cluster83.proc0.subproc0/0.key maps to url local:/cluster/condor/condor-6.8.6/hosts/node00/spool/cluster83.proc0.subproc0/0.key, which I don't know how to open.



The job just sits idle, until I remove it.

Based on the contents of the lamscript, I'm guessing it never gets past the line ". $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS"

Which avenues should I pursue for either getting the chirp server working or helping condor open up a local:/... style url?

Thanks,

Hugh

--
System Administrator
DIVMS Computer Support Group

University of Iowa
Email: hbrown@xxxxxxxxxxxxxxx
Voice: 319-335-0748

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature