[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] parallel universe and sshd.sh



Nicolas,

This is why I asked if the user Nobody was actually running the job. There is a 
difference between who submitted the job and what user is actually running it. 
Are your CONDOR_IDS variable actually set to a real user? Are you using NIS? 
Modify the scripts to echo which IDs are being used. And see whether or not the 
jobs is running as Nobody.

Danny


Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:

> Ok, I finally found it BUT I still nedd your help
> 
> The thing is : it was still a problem of "command not found", for the "grep
> Server blabla....".
> 
> I tryed by adding, each time, the necessary /bin and /usr/bin, etc, for this
> file, but once it was done, I had the exact same problem with the
> condor_exec.exe, so I need your help : where is (or should be) the $PATH set
> ? because it seems it doesn't take care of it. Can it be because I removed
> the "-oAcceptEnv" in the sshd command ?
> 
> Plus, to check/debug, I made it print the environment : 
> 
> _CONDOR_SCRATCH_DIR=/scratch/condor/execute/dir_19200
> _CONDOR_REMOTE_SPOOL_DIR=/scratch/condor/spool/cluster49.proc0.subproc0
> _CONDOR_ANCESTOR_19200=19205:1149254658:1012152896
> PATH=/ibpc/io/condor/bin:/ibpc/io/condor/bin:/ibpc/io/condor/sbin
> _CONDOR_ANCESTOR_8054=8055:1148998380:3132411392
> CONDOR_CONFIG=/ibpc/io/condor/etc/condor_config
> PWD=/scratch/condor/execute/dir_19200
> _CONDOR_ANCESTOR_8055=19200:1149254657:3132411480
> SHLVL=1
> _CONDOR_NPROCS=2
> _CONDOR_PROCNO=0
> _=/usr/bin/env
> 
> 
> So, as you can see, there is no $PATH
> 
> 
> Hope this information can help you to help me...
> Nicolas
> 
> 
> ----------------
> On Fri, 2 Jun 2006 11:24:57 +0200
> Nicolas GUIOT <nicolas.guiot@xxxxxxx> wrote:
> 
> > Danny,
> > 
> > If I type condor_q, this tells me the following : 
> > 
> >  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> >   20.0   nice-user.guio  6/2  11:20   0+00:00:58 R  0   0.0  lamscript
> mdrun.sh
> > 
> > So, it seems to be ME running the job (if I correctly understood your
> question)
> > 
> > Do you think this can have a link with my actual problem ? which is :
> > 
> > guiot@seurat:~/divers/MD$ tail -f
> /ibpc/charon/condor/execute/dir_5360/sshd.out
> > Disabling protocol version 1. Could not load host key
> > Server listening on 0.0.0.0 port 4462.
> > 
> > guiot@seurat:~/divers/MD$ tail
> > -f /ibpc/charon/condor/execute/dir_5361/sshd.out Disabling protocol
> > version 1. Could not load host key Bind to port 4466 on 0.0.0.0 failed:
> > Address already in use. Cannot bind any address.
> > 
> > 
> > Nicolas
> > 
> > 
> > ----------------
> > On Thu,  1 Jun 2006 11:02:06 -0600
> > rnayar@xxxxxxxx wrote:
> > 
> > > Nicolas,
> > > 
> > > Yeah you are right, it is the condor_ssh script (I wasn't at work when I
> wrote
> > > that). Can you verify for me which user is actually running the job? You
> had
> > > mentioned that you are using a shared file system and I'm curious, is the
> user
> > > "Nobody" actually running your jobs?  
> > > 
> > > Talk to you soon,
> > > 
> > > Danny
> > > 
> > > 
> > > 
> > > Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:
> > > 
> > > > Danny,
> > > > 
> > > > No, not really : I can't find any $prog in sshd.sh (but found one
> condor_ssh,
> > > > maybe you were talking of this one...) : I personnally only added
> "/bin/" in
> > > > front of mkdir and sleep, so that it find them :)
> > > > 
> > > > But this stills leads me to the sshd server that can't start/listen
> > > > 
> > > > Nicolas
> > > > 
> > > > ----------------
> > > > On Thu,  1 Jun 2006 09:36:46 -0600
> > > > rnayar@xxxxxxxx wrote:
> > > > 
> > > > > Ugg..
> > > > > 
> > > > > I just had written a really long answer to the problem and yet again
> our
> > > > crappy 
> > > > > email service has screwed up yet again!!
> > > > > 
> > > > > Nicolas, right what you are probably refering to is changing "$prog"
> to
> > > > "./
> > > > > $prog" in the sshd.sh script correct?
> > > > > 
> > > > > Danny
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > Yes, I am using NFS.
> > > > > > 
> > > > > > I'm interested in your modified sshd.sh (maybe something could
> help
> > > > me...)
> > > > > > 
> > > > > > By the way, I already touched it a bit : for example, it couldn't
> find
> > > > the
> > > > > > "mkdir" and "sleep" commands (probably the $PATH isn't set
> anywhere, or
> > > > it
> > > > > > doesn't pick it where it should), but this is a minor problem, that
> I
> > > > could
> > > > > > solve...
> > > > > > 
> > > > > > ++
> > > > > > Nicolas
> > > > > > 
> > > > > > ----------------
> > > > > > On Wed, 31 May 2006 10:12:27 -0600
> > > > > > rnayar@xxxxxxxx wrote:
> > > > > > 
> > > > > > > Nicolas,
> > > > > > > 
> > > > > > > Hey buddy, just curious how is your grid setup? Are you using a
> shared
> > > > 
> > > > > > > filesystem? Not to long ago I was running MPI jobs in the
> parallel
> > > > universe
> > > > > > 
> > > > > > > without the use of a shared filesystem. I do recall seeing some
> of
> > > > the
> > > > > > things 
> > > > > > > you listed while trouble shooting the problem. Greg Thain and
> myself
> > > > were
> > > > > > able 
> > > > > > > to haxor Condor and the sshd to make it work. I can provide the
> > > > > > modifications 
> > > > > > > as soon I can.
> > > > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > Danny Nayar
> > > > > > > New Mexico State University
> > > > > > > 
> > > > > > > 
> > > > > > > Quoting Nicolas GUIOT <nicolas.guiot@xxxxxxx>:
> > > > > > > 
> > > > > > > > Hi all
> > > > > > > > 
> > > > > > > > I'm coming back on this issue.
> > > > > > > > 
> > > > > > > > In the sshd.sh script I have by default (6.7.18, yeah I know, I
> plan
> > > > to
> > > > > > > > upgarde soon...), this line is already replaced with 
> > > > > > > > 
> > > > > > > > 	if grep "Server listening" sshd.out > /dev/null 2>&1
> > > > > > > > 
> > > > > > > > But I still have a problem, and very strange things : 
> > > > > > > > - First, I had to modify the sshd command line, since I'm in
> debian
> > > > > > stable,
> > > > > > > > and sshd is only 3.8.x, and doesn't understand  "-oAcceptEnv" ,
> so
> > > > I
> > > > > > removed
> > > > > > > > it : Maybe it's the reason to my problem (if so, do you know a
> way
> > > > to
> > > > > > > > workaround this ?)
> > > > > > > > 
> > > > > > > > - Then, when I submit the job, it says it's running (condor_q
> state
> > > > is
> > > > > > R),
> > > > > > > > but when I check on the node, I have the following things : 
> > > > > > > > 
> > > > > > > > guiot@seurat:~/divers/MD$ tail -f
> > > > > > > > /ibpc/charon/condor/execute/dir_28262/sshd.out
> > > > > > > > Disabling protocol version 1. Could not load host key
> > > > > > > > Bind to port 4465 on 0.0.0.0 failed: Address already in use.
> > > > > > > > Cannot bind any address.
> > > > > > > > 
> > > > > > > > guiot@seurat:~/divers/MD$ tail -f
> > > > > > > > /ibpc/charon/condor/execute/dir_28264/sshd.out 
> > > > > > > > Disabling protocol version 1. Could not load host key 
> > > > > > > > Server listening on 0.0.0.0 port 4468.
> > > > > > > > 
> > > > > > > > So, as you can see : 1 of the process seems to be fine, and
> the
> > > > other
> > > > > > not,
> > > > > > > > but in truth, if I check a "ps ax|grep sshd", I can see none
> of
> > > > them
> > > > > > running
> > > > > > > > (or just the one trying to be created, which changes
> constantly)
> > > > > > > > 
> > > > > > > > #ps ax|grep sshd
> > > > > > > >   758 ?        Ss     0:03 /usr/sbin/sshd
> > > > > > > > 10819 ?        Ss     0:00 sshd: root@pts/0
> > > > > > > > 28727 ?        SN     0:00 /usr/sbin/sshd -p4474
> > > > > > > >
> > > > -oAuthorizedKeysFile=/scratch/condor/execute/dir_28262/tmp/0.key.pub
> > > > > > > > -h/scratch/condor/execute/dir_28262/tmp/hostkey -De
> -f/dev/null
> > > > > > > > -oStrictModes=no -oPidFile=/dev/null
> > > > > > > > 
> > > > > > > > 
> > > > > > > > and if I check again for the process which was fine (tail
> sshd.out),
> > > > it
> > > > > > keeps
> > > > > > > > telling me it's fine, but it's listening on a new port !!?!?!
> > > > > > > > 
> > > > > > > > So : Is this related to the changes I had to make
> (-oAcceptEnv), or
> > > > is
> > > > > > it
> > > > > > > > something really apart ? What could I check to solve this ?
> > > > > > > > 
> > > > > > > > Thanks in advance
> > > > > > > > Nicolas
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Unfortunately the jobs starts 'running' but is blocked. For
> some
> > > > reason
> > > > > > 
> > > > > > > > > it starts some connections, but does not seem to recognize
> them
> > > > (and 
> > > > > > > > > then try with a next new port, again and again). I tried to
> look at
> > > > the
> > > > > > 
> > > > > > > > > files and find out what might be the reason for this. In 
> > > > > > > > > /usr/local/condor/libexec/sshd.sh there is a line like this
> :
> > > > > > > > > 
> > > > > > > > > 	if grep "^Server listening on 0.0.0.0 port" sshd.out >
> /dev/null
> > > > > > 2>&1
> > > > > > > > > 
> > > > > > > > > I replaced this by :
> > > > > > > > > 
> > > > > > > > > 	if grep "Server listening on :: port" sshd.out 
> /dev/null
> 2>&1
> > > > > > > > > 
> > > > > > > > > Not sure at all if there was a typo, but I had the '^' this
> on the
> > > > two
> > > > > > 
> > > > > > > > > computers.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > > > ----------
> > > > > > 
> > > > > > ----------------------------------------------------
> > > > > > CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> > > > > > Institut de Biologie Physico-Chimique
> > > > > > 13 rue Pierre et Marie Curie
> > > > > > 75005 PARIS - FRANCE
> > > > > > 
> > > > > > Tel : +33 158 41 51 70
> > > > > > Fax : +33 158 41 50 26
> > > > > > ----------------------------------------------------
> > > > > > _______________________________________________
> > > > > > Condor-users mailing list
> > > > > > Condor-users@xxxxxxxxxxx
> > > > > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > > > > > 
> > > > > 
> > > > > 
> > > > > _______________________________________________
> > > > > Condor-users mailing list
> > > > > Condor-users@xxxxxxxxxxx
> > > > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > > > > 
> > > > 
> > > > ----------
> > > > 
> > > > ----------------------------------------------------
> > > > CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> > > > Institut de Biologie Physico-Chimique
> > > > 13 rue Pierre et Marie Curie
> > > > 75005 PARIS - FRANCE
> > > > 
> > > > Tel : +33 158 41 51 70
> > > > Fax : +33 158 41 50 26
> > > > ----------------------------------------------------
> > > > _______________________________________________
> > > > Condor-users mailing list
> > > > Condor-users@xxxxxxxxxxx
> > > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > > > 
> > > 
> > > 
> > > _______________________________________________
> > > Condor-users mailing list
> > > Condor-users@xxxxxxxxxxx
> > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > > 
> > 
> > ----------
> > 
> > ----------------------------------------------------
> > CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> > Institut de Biologie Physico-Chimique
> > 13 rue Pierre et Marie Curie
> > 75005 PARIS - FRANCE
> > 
> > Tel : +33 158 41 51 70
> > Fax : +33 158 41 50 26
> > ----------------------------------------------------
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> 
> ----------
> 
> ----------------------------------------------------
> CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
> Institut de Biologie Physico-Chimique
> 13 rue Pierre et Marie Curie
> 75005 PARIS - FRANCE
> 
> Tel : +33 158 41 51 70
> Fax : +33 158 41 50 26
> ----------------------------------------------------
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>