[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Running distributed MPI jobs and ssh settings



On Tue, 2013-03-05 at 12:43 -0300, Manuel Ferreria wrote:
> Hello,
> 
> 
> I've been trying to set up a virtual condor cluster, to better
> understand how to administer it. I've been running into this problem
> and can't seem to wrap my head around it.
> 
> 
> I have 2 nodes setup, "head.cluster" and "c1.cluster" (both are mapped
> using hosts file to the corresponding ips). head.cluster is both the
> submitter, the schedd and a compute node. c1.cluster is just a compute
> node. I've setup passwordless ssh login between both machines (for my
> local user at least). Both VM's have 4 cores each, so I have in total
> 8 compute cpus.
> 
> 
> I made a distributed hello world application (just printing the rank).
> I compile it with mpicc (which uses a user built openMPI) and then run
> condor_submit.
> 
> 
> Now, if the job is schedule to run on a single machine (head.cluster
> OR c1.cluster) the job runs ok, and prints expected output. BUT if the
> job runs distributed, as in 1 process con head.cluster and 1 on
> c1.cluster, the job completes but mpirun crashes with:
> 
> 
> --------------------------------------------------------------------------
> A daemon (pid 2371) died unexpectedly with status 255 while attempting
> to launch so we are aborting.

Any idea what pid 2371 was? sshd?

> 
> 
> There may be more information reported by the environment (see above).
> 
> 
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> 
> 
> 
> 
> I've checked LD_LIBRARY_PATH and it's exactly the same on both
> machines. The curious thing is if I manually run mpirun with the same
> hostfile generated by the openMPI script (but without the orte and
> condor_sshd parameter) it works fine. 
> 
> 
> If I remove those parameters in the openmpiscript called by the submit
> file (to make it exactly the same as i am running outside of condor),
> I get the same error preceded by 
> ----
> Could not create directory '/.ssh'.
> No protocol specified
> 
Sounds like you are trying to set up ssh keys despite already having
passwordless ssh configured. I'd have to see the openmpiscript and sdf
to say much more.



> 
> (gnome-ssh-askpass:2825): Gtk-WARNING **: cannot open display: :0.0
> Host key verification failed.
> 
> 
> ----------------
> 
> 
> 
> 
> i am guessing it's some setting of ssh I haven't configured properly,
> but I can't put my finger on it. Anyone has an idea?
> 
> 
> Regards,
> 
> 
> Manuel Ferreria
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/