[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Running distributed MPI jobs and ssh settings


I've been trying to set up a virtual condor cluster, to better understand how to administer it. I've been running into this problem and can't seem to wrap my head around it.

I have 2 nodes setup, "head.cluster" and "c1.cluster" (both are mapped using hosts file to the corresponding ips). head.cluster is both the submitter, the schedd and a compute node. c1.cluster is just a compute node. I've setup passwordless ssh login between both machines (for my local user at least). Both VM's have 4 cores each, so I have in total 8 compute cpus.

I made a distributed hello world application (just printing the rank). I compile it with mpicc (which uses a user built openMPI) and then run condor_submit.

Now, if the job is schedule to run on a single machine (head.cluster OR c1.cluster) the job runs ok, and prints expected output. BUT if the job runs distributed, as in 1 process con head.cluster and 1 on c1.cluster, the job completes but mpirun crashes with:

A daemon (pid 2371) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.

I've checked LD_LIBRARY_PATH and it's exactly the same on both machines. The curious thing is if I manually run mpirun with the same hostfile generated by the openMPI script (but without the orte and condor_sshd parameter) it works fine. 

If I remove those parameters in the openmpiscript called by the submit file (to make it exactly the same as i am running outside of condor), I get the same error preceded by 
Could not create directory '/.ssh'.
No protocol specified

(gnome-ssh-askpass:2825): Gtk-WARNING **: cannot open display: :0.0
Host key verification failed.


i am guessing it's some setting of ssh I haven't configured properly, but I can't put my finger on it. Anyone has an idea?


Manuel Ferreria