[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] parallel universe and sshd.sh



Danny, 

I inserted a "whoami" and an "id" in sshd.sh, and it's running as me, with correct uid/gid.

I'am using NIS, and the CONDOR_IDS variables are correctly set to the "condor" user, which is also NIS registered...

After a bit of thinking, I'm quite sure the error has nothing to do with my "-oAcceptEnv" for sshd modification, so I won't ask again about this until this 1st problem is solved :)

Just thought about something else : I am using /bin/bash as shell. I read soemwhere in the doc that this script (although running on /bin/sh) was only .csh compatible. I tried to change my shell to csh, but that didn't work either.

I finally added the PATH in sshd.sh : it's not a very elegant solution : If you know better, please let me know.

Nicolas



----------------
On Fri,  2 Jun 2006 07:59:23 -0600
rnayar@xxxxxxxx wrote:

> Nicolas,
> 
> This is why I asked if the user Nobody was actually running the job. There is a 
> difference between who submitted the job and what user is actually running it. 
> Are your CONDOR_IDS variable actually set to a real user? Are you using NIS? 
> Modify the scripts to echo which IDs are being used. And see whether or not the 
> jobs is running as Nobody.
> 
> Danny
> 
> 
> Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:
> 
> > Ok, I finally found it BUT I still nedd your help
> > 
> > The thing is : it was still a problem of "command not found", for the "grep
> > Server blabla....".
> > 
> > I tryed by adding, each time, the necessary /bin and /usr/bin, etc, for this
> > file, but once it was done, I had the exact same problem with the
> > condor_exec.exe, so I need your help : where is (or should be) the $PATH set
> > ? because it seems it doesn't take care of it. Can it be because I removed
> > the "-oAcceptEnv" in the sshd command ?
> > 
> > Plus, to check/debug, I made it print the environment : 
> > 
> > _CONDOR_SCRATCH_DIR=/scratch/condor/execute/dir_19200
> > _CONDOR_REMOTE_SPOOL_DIR=/scratch/condor/spool/cluster49.proc0.subproc0
> > _CONDOR_ANCESTOR_19200=19205:1149254658:1012152896
> > PATH=/ibpc/io/condor/bin:/ibpc/io/condor/bin:/ibpc/io/condor/sbin
> > _CONDOR_ANCESTOR_8054=8055:1148998380:3132411392
> > CONDOR_CONFIG=/ibpc/io/condor/etc/condor_config
> > PWD=/scratch/condor/execute/dir_19200
> > _CONDOR_ANCESTOR_8055=19200:1149254657:3132411480
> > SHLVL=1
> > _CONDOR_NPROCS=2
> > _CONDOR_PROCNO=0
> > _=/usr/bin/env
> > 
> > 
> > So, as you can see, there is no $PATH
> > 
> > 
> > Hope this information can help you to help me...
> > Nicolas
> > 
> > 
> > ----------------
> > On Fri, 2 Jun 2006 11:24:57 +0200
> > Nicolas GUIOT <nicolas.guiot@xxxxxxx> wrote:
> > 
> > > Danny,
> > > 
> > > If I type condor_q, this tells me the following : 
> > > 
> > >  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> > >   20.0   nice-user.guio  6/2  11:20   0+00:00:58 R  0   0.0  lamscript
> > mdrun.sh
> > > 
> > > So, it seems to be ME running the job (if I correctly understood your
> > question)
> > > 
> > > Do you think this can have a link with my actual problem ? which is :
> > > 
> > > guiot@seurat:~/divers/MD$ tail -f
> > /ibpc/charon/condor/execute/dir_5360/sshd.out
> > > Disabling protocol version 1. Could not load host key
> > > Server listening on 0.0.0.0 port 4462.
> > > 
> > > guiot@seurat:~/divers/MD$ tail
> > > -f /ibpc/charon/condor/execute/dir_5361/sshd.out Disabling protocol
> > > version 1. Could not load host key Bind to port 4466 on 0.0.0.0 failed:
> > > Address already in use. Cannot bind any address.
> > > 
> > > 
> > > Nicolas
> > > 
> > > 
> > > ----------------
> > > On Thu,  1 Jun 2006 11:02:06 -0600
> > > rnayar@xxxxxxxx wrote:
> > > 
> > > > Nicolas,
> > > > 
> > > > Yeah you are right, it is the condor_ssh script (I wasn't at work when I
> > wrote
> > > > that). Can you verify for me which user is actually running the job? You
> > had
> > > > mentioned that you are using a shared file system and I'm curious, is the
> > user
> > > > "Nobody" actually running your jobs?  
> > > > 
> > > > Talk to you soon,
> > > > 
> > > > Danny
> > > > 
> > > > 
> > > > 
> > > > Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:
> > > > 
> > > > > Danny,
> > > > > 
> > > > > No, not really : I can't find any $prog in sshd.sh (but found one
> > condor_ssh,
> > > > > maybe you were talking of this one...) : I personnally only added
> > "/bin/" in
> > > > > front of mkdir and sleep, so that it find them :)
> > > > > 
> > > > > But this stills leads me to the sshd server that can't start/listen
> > > > > 
> > > > > Nicolas
> > > > > 
> > > > > ----------------
> > > > > On Thu,  1 Jun 2006 09:36:46 -0600
> > > > > rnayar@xxxxxxxx wrote:
> > > > > 
> > > > > > Ugg..
> > > > > > 
> > > > > > I just had written a really long answer to the problem and yet again
> > our
> > > > > crappy 
> > > > > > email service has screwed up yet again!!
> > > > > > 
> > > > > > Nicolas, right what you are probably refering to is changing "$prog"
> > to
> > > > > "./
> > > > > > $prog" in the sshd.sh script correct?
> > > > > > 
> > > > > > Danny
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:
> > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Yes, I am using NFS.
> > > > > > > 
> > > > > > > I'm interested in your modified sshd.sh (maybe something could
> > help
> > > > > me...)
> > > > > > > 
> > > > > > > By the way, I already touched it a bit : for example, it couldn't
> > find
> > > > > the
> > > > > > > "mkdir" and "sleep" commands (probably the $PATH isn't set
> > anywhere, or
> > > > > it
> > > > > > > doesn't pick it where it should), but this is a minor problem, that
> > I
> > > > > could
> > > > > > > solve...
> > > > > > > 
> > > > > > > ++
> > > > > > > Nicolas
> > > > > > > 
> > > > > > > ----------------
> > > > > > > On Wed, 31 May 2006 10:12:27 -0600
> > > > > > > rnayar@xxxxxxxx wrote:
> > > > > > > 
> > > > > > > > Nicolas,
> > > > > > > > 
> > > > > > > > Hey buddy, just curious how is your grid setup? Are you using a
> > shared
> > > > > 
> > > > > > > > filesystem? Not to long ago I was running MPI jobs in the
> > parallel
> > > > > universe
> > > > > > > 
> > > > > > > > without the use of a shared filesystem. I do recall seeing some
> > of
> > > > > the
> > > > > > > things 
> > > > > > > > you listed while trouble shooting the problem. Greg Thain and
> > myself
> > > > > were
> > > > > > > able 
> > > > > > > > to haxor Condor and the sshd to make it work. I can provide the
> > > > > > > modifications 
> > > > > > > > as soon I can.
> > > > > > > > 
> > > > > > > > Cheers,
> > > > > > > > 
> > > > > > > > Danny Nayar
> > > > > > > > New Mexico State University
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:
> > > > > > > > 
> > > > > > > > > Hi all
> > > > > > > > > 
> > > > > > > > > I'm coming back on this issue.
> > > > > > > > > 
> > > > > > > > > In the sshd.sh script I have by default (6.7.18, yeah I know, I
> > plan
> > > > > to
> > > > > > > > > upgarde soon...), this line is already replaced with 
> > > > > > > > > 
> > > > > > > > > 	if grep "Server listening" sshd.out > /dev/null 2>&1
> > > > > > > > > 
> > > > > > > > > But I still have a problem, and very strange things : 
> > > > > > > > > - First, I had to modify the sshd command line, since I'm in
> > debian
> > > > > > > stable,
> > > > > > > > > and sshd is only 3.8.x, and doesn't understand  "-oAcceptEnv" ,
> > so
> > > > > I
> > > > > > > removed
> > > > > > > > > it : Maybe it's the reason to my problem (if so, do you know a
> > way
> > > > > to
> > > > > > > > > workaround this ?)
> > > > > > > > > 
> > > > > > > > > - Then, when I submit the job, it says it's running (condor_q
> > state
> > > > > is
> > > > > > > R),
> > > > > > > > > but when I check on the node, I have the following things : 
> > > > > > > > > 
> > > > > > > > > guiot@seurat:~/divers/MD$ tail -f
> > > > > > > > > /ibpc/charon/condor/execute/dir_28262/sshd.out
> > > > > > > > > Disabling protocol version 1. Could not load host key
> > > > > > > > > Bind to port 4465 on 0.0.0.0 failed: Address already in use.
> > > > > > > > > Cannot bind any address.
> > > > > > > > > 
> > > > > > > > > guiot@seurat:~/divers/MD$ tail -f
> > > > > > > > > /ibpc/charon/condor/execute/dir_28264/sshd.out 
> > > > > > > > > Disabling protocol version 1. Could not load host key 
> > > > > > > > > Server listening on 0.0.0.0 port 4468.
> > > > > > > > > 
> > > > > > > > > So, as you can see : 1 of the process seems to be fine, and
> > the
> > > > > other
> > > > > > > not,
> > > > > > > > > but in truth, if I check a "ps ax|grep sshd", I can see none
> > of
> > > > > them
> > > > > > > running
> > > > > > > > > (or just the one trying to be created, which changes
> > constantly)
> > > > > > > > > 
> > > > > > > > > #ps ax|grep sshd
> > > > > > > > >   758 ?        Ss     0:03 /usr/sbin/sshd
> > > > > > > > > 10819 ?        Ss     0:00 sshd: root@pts/0
> > > > > > > > > 28727 ?        SN     0:00 /usr/sbin/sshd -p4474
> > > > > > > > >
> > > > > -oAuthorizedKeysFile=/scratch/condor/execute/dir_28262/tmp/0.key.pub
> > > > > > > > > -h/scratch/condor/execute/dir_28262/tmp/hostkey -De
> > -f/dev/null
> > > > > > > > > -oStrictModes=no -oPidFile=/dev/null
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > and if I check again for the process which was fine (tail
> > sshd.out),
> > > > > it
> > > > > > > keeps
> > > > > > > > > telling me it's fine, but it's listening on a new port !!?!?!
> > > > > > > > > 
> > > > > > > > > So : Is this related to the changes I had to make
> > (-oAcceptEnv), or
> > > > > is
> > > > > > > it
> > > > > > > > > something really apart ? What could I check to solve this ?
> > > > > > > > > 
> > > > > > > > > Thanks in advance
> > > > > > > > > Nicolas
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Unfortunately the jobs starts 'running' but is blocked. For
> > some
> > > > > reason
> > > > > > > 
> > > > > > > > > > it starts some connections, but does not seem to recognize
> > them
> > > > > (and 
> > > > > > > > > > then try with a next new port, again and again). I tried to
> > look at
> > > > > the
> > > > > > > 
> > > > > > > > > > files and find out what might be the reason for this. In 
> > > > > > > > > > /usr/local/condor/libexec/sshd.sh there is a line like this
> > :
> > > > > > > > > > 
> > > > > > > > > > 	if grep "^Server listening on 0.0.0.0 port" sshd.out >
> > /dev/null
> > > > > > > 2>&1
> > > > > > > > > > 
> > > > > > > > > > I replaced this by :
> > > > > > > > > > 
> > > > > > > > > > 	if grep "Server listening on :: port" sshd.out 
> > /dev/null
> > 2>&1
> > > > > > > > > > 
> > > > > > > > > > Not sure at all if there was a typo, but I had the '^' this
> > on the
> > > > > two
> > > > > > > 
> > > > > > > > > > computers.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > 


----------------------------------------------------
CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
Institut de Biologie Physico-Chimique
13 rue Pierre et Marie Curie
75005 PARIS - FRANCE

Tel : +33 158 41 51 70
Fax : +33 158 41 50 26
----------------------------------------------------