[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] parallel universe and sshd.sh



Nicolas,

Yeah you are right, it is the condor_ssh script (I wasn't at work when I wrote
that). Can you verify for me which user is actually running the job? You had
mentioned that you are using a shared file system and I'm curious, is the user
"Nobody" actually running your jobs?  

Talk to you soon,

Danny



Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:

> Danny,
> 
> No, not really : I can't find any $prog in sshd.sh (but found one condor_ssh,
> maybe you were talking of this one...) : I personnally only added "/bin/" in
> front of mkdir and sleep, so that it find them :)
> 
> But this stills leads me to the sshd server that can't start/listen
> 
> Nicolas
> 
> ----------------
> On Thu,  1 Jun 2006 09:36:46 -0600
> rnayar@xxxxxxxx wrote:
> 
> > Ugg..
> > 
> > I just had written a really long answer to the problem and yet again our
> crappy 
> > email service has screwed up yet again!!
> > 
> > Nicolas, right what you are probably refering to is changing "$prog" to
> "./
> > $prog" in the sshd.sh script correct?
> > 
> > Danny
> > 
> > 
> > 
> > 
> > Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:
> > 
> > > Hi,
> > > 
> > > Yes, I am using NFS.
> > > 
> > > I'm interested in your modified sshd.sh (maybe something could help
> me...)
> > > 
> > > By the way, I already touched it a bit : for example, it couldn't find
> the
> > > "mkdir" and "sleep" commands (probably the $PATH isn't set anywhere, or
> it
> > > doesn't pick it where it should), but this is a minor problem, that I
> could
> > > solve...
> > > 
> > > ++
> > > Nicolas
> > > 
> > > ----------------
> > > On Wed, 31 May 2006 10:12:27 -0600
> > > rnayar@xxxxxxxx wrote:
> > > 
> > > > Nicolas,
> > > > 
> > > > Hey buddy, just curious how is your grid setup? Are you using a shared
> 
> > > > filesystem? Not to long ago I was running MPI jobs in the parallel
> universe
> > > 
> > > > without the use of a shared filesystem. I do recall seeing some of
> the
> > > things 
> > > > you listed while trouble shooting the problem. Greg Thain and myself
> were
> > > able 
> > > > to haxor Condor and the sshd to make it work. I can provide the
> > > modifications 
> > > > as soon I can.
> > > > 
> > > > Cheers,
> > > > 
> > > > Danny Nayar
> > > > New Mexico State University
> > > > 
> > > > 
> > > > Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:
> > > > 
> > > > > Hi all
> > > > > 
> > > > > I'm coming back on this issue.
> > > > > 
> > > > > In the sshd.sh script I have by default (6.7.18, yeah I know, I plan
> to
> > > > > upgarde soon...), this line is already replaced with 
> > > > > 
> > > > > 	if grep "Server listening" sshd.out > /dev/null 2>&1
> > > > > 
> > > > > But I still have a problem, and very strange things : 
> > > > > - First, I had to modify the sshd command line, since I'm in debian
> > > stable,
> > > > > and sshd is only 3.8.x, and doesn't understand  "-oAcceptEnv" , so
> I
> > > removed
> > > > > it : Maybe it's the reason to my problem (if so, do you know a way
> to
> > > > > workaround this ?)
> > > > > 
> > > > > - Then, when I submit the job, it says it's running (condor_q state
> is
> > > R),
> > > > > but when I check on the node, I have the following things : 
> > > > > 
> > > > > guiot@seurat:~/divers/MD$ tail -f
> > > > > /ibpc/charon/condor/execute/dir_28262/sshd.out
> > > > > Disabling protocol version 1. Could not load host key
> > > > > Bind to port 4465 on 0.0.0.0 failed: Address already in use.
> > > > > Cannot bind any address.
> > > > > 
> > > > > guiot@seurat:~/divers/MD$ tail -f
> > > > > /ibpc/charon/condor/execute/dir_28264/sshd.out 
> > > > > Disabling protocol version 1. Could not load host key 
> > > > > Server listening on 0.0.0.0 port 4468.
> > > > > 
> > > > > So, as you can see : 1 of the process seems to be fine, and the
> other
> > > not,
> > > > > but in truth, if I check a "ps ax|grep sshd", I can see none of
> them
> > > running
> > > > > (or just the one trying to be created, which changes constantly)
> > > > > 
> > > > > #ps ax|grep sshd
> > > > >   758 ?        Ss     0:03 /usr/sbin/sshd
> > > > > 10819 ?        Ss     0:00 sshd: root@pts/0
> > > > > 28727 ?        SN     0:00 /usr/sbin/sshd -p4474
> > > > >
> -oAuthorizedKeysFile=/scratch/condor/execute/dir_28262/tmp/0.key.pub
> > > > > -h/scratch/condor/execute/dir_28262/tmp/hostkey -De -f/dev/null
> > > > > -oStrictModes=no -oPidFile=/dev/null
> > > > > 
> > > > > 
> > > > > and if I check again for the process which was fine (tail sshd.out),
> it
> > > keeps
> > > > > telling me it's fine, but it's listening on a new port !!?!?!
> > > > > 
> > > > > So : Is this related to the changes I had to make (-oAcceptEnv), or
> is
> > > it
> > > > > something really apart ? What could I check to solve this ?
> > > > > 
> > > > > Thanks in advance
> > > > > Nicolas
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Unfortunately the jobs starts 'running' but is blocked. For some
> reason
> > > 
> > > > > > it starts some connections, but does not seem to recognize them
> (and 
> > > > > > then try with a next new port, again and again). I tried to look at
> the
> > > 
> > > > > > files and find out what might be the reason for this. In 
> > > > > > /usr/local/condor/libexec/sshd.sh there is a line like this :
> > > > > > 
> > > > > > 	if grep "^Server listening on 0.0.0.0 port" sshd.out > /dev/null
> > > 2>&1
> > > > > > 
> > > > > > I replaced this by :
> > > > > > 
> > > > > > 	if grep "Server listening on :: port" sshd.out > /dev/null 2>&1
> > > > > > 
> > > > > > Not sure at all if there was a typo, but I had the '^' this on the
> two
> > > 
> > > > > > computers.
> > > > > > 
> > > > > 
> > > > > 
> > > 
> > > ----------
> > > 
> > > ----------------------------------------------------
> > > CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> > > Institut de Biologie Physico-Chimique
> > > 13 rue Pierre et Marie Curie
> > > 75005 PARIS - FRANCE
> > > 
> > > Tel : +33 158 41 51 70
> > > Fax : +33 158 41 50 26
> > > ----------------------------------------------------
> > > _______________________________________________
> > > Condor-users mailing list
> > > Condor-users@xxxxxxxxxxx
> > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > > 
> > 
> > 
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> 
> ----------
> 
> ----------------------------------------------------
> CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> Institut de Biologie Physico-Chimique
> 13 rue Pierre et Marie Curie
> 75005 PARIS - FRANCE
> 
> Tel : +33 158 41 51 70
> Fax : +33 158 41 50 26
> ----------------------------------------------------
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>