[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Daemon problems




condor-users-bounces@xxxxxxxxxxx wrote on 06/16/2005 08:32:47 PM:

> On Thu June 16 2005 5:13 am, Alexandre Badez wrote:
> > Good Morning !
> Hello,
>
> > I'm running a little test cluster of 6 machines, with redhat 3. They are
> > named node1 to node6 (ip @ 10.2.4.11 to 10.2.4.16), and my domain name is
> > *.mop.ibm.com
> > I've setup the 6 machines with the rpm avaiable on the download pages
> > (Condor 6.6.9).
> > My central manager is node1, all others are execution hosts.
> >
> > My problem, seems to be my node1 where there is no negociator:
> >
> > [root@node1 root]# condor_master
> > [root@node1 root]# ps ax | grep condor
> >  5137 ?        S      0:00 condor_master
> >  5138 ?        S      0:00 condor_collector -f
> >  5139 ?        R      0:03 condor_startd -f
> >  5142 ?        S      0:00 condor_schedd -f
> >  5149 pts/0    S      0:00 grep condor
> > [root@node1 root]#
>
> I don't know much about how our RPMs configure Condor, but I can see that
> something is wrong here...  Your central manager (node1) should be running
> both the collector and the negotiator.  Look at the DAEMON_LIST setting in
> the condor_config (or condor_config.local), and make sure that both COLLECTOR
> NEGOTIATOR is in the list.

The COLLECTOR and NEGOTIATOR were un the list.

>
> Also, if you don't want to be running jobs on this machine, remove
> STARTD from
> the list.  Similarly, if you aren't going to be submitting jobs from this
> host, remove SCHEDD from the list.

Thanks for the this information, but actuallys it's just for running some test, not for a real use.

>
> > Moreover there is a negociator on each execution node:
> >
> > [root@node2 root]# condor_master
> > [root@node2 root]# ps ax | grep condor
> > 29704 ?        S      0:00 condor_master
> > 29705 ?        S      0:00 condor_collector -f
> > 29706 ?        S      0:00 condor_negotiator -f
> > 29707 ?        S      0:06 condor_startd -f
> > 29708 ?        S      0:00 condor_schedd -f
> > 29717 pts/0    R      0:00 grep condor
> > [root@node2 root]#
>
> Again, edit your condor_config on the execution node(s), and remove COLLECTOR
> and NEGOTIATOR from the DAEMON_LIST.

My mistake...

>
> As above, I'll note that you're running the schedd here, which allows you to
> submit jobs from this host.  If this is not what you intended, then remove
> SCHEDD from the list.
>
> You'll need to restart Condor on the affected nodes for these changes to take
> effect.  "condor_restart -master node1", or "/etc/init.d/condor restart" (or
> similar).
>
> > Is it normal? After re-reading the installation manual, it don't seems
> > so...
>
> Nope.  See above.  I don't know _why_ they're set as they are, but it's
> obviously wrong.
>
> > I can also send the config and config local files if you need them.
>
> Try the above first -- it'll probably solve the problems that you'reseeing.  
> If not, we can pursue it further.
>
> > Thanks for your help.
>
> Glad to help!
>
> -Nick

Thanks Nick, but actually, my node1 do not want to execute the negociator (don't know why) in the Master's log file, it's only written that the negociator failed to execute and will retry later...

On the contrary, there is no problems on my others node. So I use my node2 as central manager, and it seems to work great now. But I wonder why I can't execute the negociator on my node1. Indeed, my nodes are quiet exactly the same (same hardware, same OS, same configuration), the only difference is that on my node1 I share a folder by NFS with oter node.
Maybe a bug ?

I'm searching for more information about this.


Cordialement / Best Regards
--
Alexandre Badez