[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] parallel universe and sshd.sh



Nicolas,

Hey buddy, just curious how is your grid setup? Are you using a shared 
filesystem? Not to long ago I was running MPI jobs in the parallel universe 
without the use of a shared filesystem. I do recall seeing some of the things 
you listed while trouble shooting the problem. Greg Thain and myself were able 
to haxor Condor and the sshd to make it work. I can provide the modifications 
as soon I can.

Cheers,

Danny Nayar
New Mexico State University


Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:

> Hi all
> 
> I'm coming back on this issue.
> 
> In the sshd.sh script I have by default (6.7.18, yeah I know, I plan to
> upgarde soon...), this line is already replaced with 
> 
> 	if grep "Server listening" sshd.out > /dev/null 2>&1
> 
> But I still have a problem, and very strange things : 
> - First, I had to modify the sshd command line, since I'm in debian stable,
> and sshd is only 3.8.x, and doesn't understand  "-oAcceptEnv" , so I removed
> it : Maybe it's the reason to my problem (if so, do you know a way to
> workaround this ?)
> 
> - Then, when I submit the job, it says it's running (condor_q state is R),
> but when I check on the node, I have the following things : 
> 
> guiot@seurat:~/divers/MD$ tail -f
> /ibpc/charon/condor/execute/dir_28262/sshd.out
> Disabling protocol version 1. Could not load host key
> Bind to port 4465 on 0.0.0.0 failed: Address already in use.
> Cannot bind any address.
> 
> guiot@seurat:~/divers/MD$ tail -f
> /ibpc/charon/condor/execute/dir_28264/sshd.out 
> Disabling protocol version 1. Could not load host key 
> Server listening on 0.0.0.0 port 4468.
> 
> So, as you can see : 1 of the process seems to be fine, and the other not,
> but in truth, if I check a "ps ax|grep sshd", I can see none of them running
> (or just the one trying to be created, which changes constantly)
> 
> #ps ax|grep sshd
>   758 ?        Ss     0:03 /usr/sbin/sshd
> 10819 ?        Ss     0:00 sshd: root@pts/0
> 28727 ?        SN     0:00 /usr/sbin/sshd -p4474
> -oAuthorizedKeysFile=/scratch/condor/execute/dir_28262/tmp/0.key.pub
> -h/scratch/condor/execute/dir_28262/tmp/hostkey -De -f/dev/null
> -oStrictModes=no -oPidFile=/dev/null
> 
> 
> and if I check again for the process which was fine (tail sshd.out), it keeps
> telling me it's fine, but it's listening on a new port !!?!?!
> 
> So : Is this related to the changes I had to make (-oAcceptEnv), or is it
> something really apart ? What could I check to solve this ?
> 
> Thanks in advance
> Nicolas
> 
> 
> > 
> > Unfortunately the jobs starts 'running' but is blocked. For some reason 
> > it starts some connections, but does not seem to recognize them (and 
> > then try with a next new port, again and again). I tried to look at the 
> > files and find out what might be the reason for this. In 
> > /usr/local/condor/libexec/sshd.sh there is a line like this :
> > 
> > 	if grep "^Server listening on 0.0.0.0 port" sshd.out > /dev/null 2>&1
> > 
> > I replaced this by :
> > 
> > 	if grep "Server listening on :: port" sshd.out > /dev/null 2>&1
> > 
> > Not sure at all if there was a typo, but I had the '^' this on the two 
> > computers.
> > 
> 
> 
> ----------------------------------------------------
> CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> Institut de Biologie Physico-Chimique
> 13 rue Pierre et Marie Curie
> 75005 PARIS - FRANCE
> 
> Tel : +33 158 41 51 70
> Fax : +33 158 41 50 26
> ----------------------------------------------------
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>