[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs stuck in queue



Hello:

On Mon, 2011-08-22 at 15:07 -0300, Fabricio Cannini wrote:
> > > Any tips to what may (not) be going on are very, very, veeeeery welcome.
> > 
> > It doesn't look like you defined DedicatedScheduler on your execute
> > nodes. Likely needs to look like:
> > 
> > DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx"
> > 
> > Without this attribute, your scheduler will not match parallel jobs with
> > dedicated execute nodes.
> > 
> > Take a look at
> > http://www.cs.wisc.edu/condor/manual/v7.6/3_13Setting_Up.html#SECTION004131
> > 0100000000000000 for more information.
> > 
> > Best of luck,
> > DJH
> 
> Hi.
> 
> I've tried that, but unfortunately it didn't solve. Worse, now i can't see the 
> pool!

Well, you are going to have to define DedicatedScheduler on your execute
nodes in order to match in the parallel universe (there's no way around
it that I know of).

As for the pool problems, I would start with your security settings.
While I would never recommend setting security wide open on production
systems, but until you get everything up and running I would set
ALLOW_READ = *
ALLOW_WRITE = *
and don't change any of the values of ALLOW_NEGOTIATOR, ALLOW_DAEMON,
etc. from the values in the standard UW config. You can begin to scale
back access to these subsystems once things work appropriately (also
making sure that you have authentication turned on).

Also - is it possible that the dedicated scheduler machine has two
network interfaces? You can use condor_master -schedd to confirm that
the hostname used in your ALLOW_ and DedicatedScheduler configuration
settings is appropriate (if the hostname is incorrect, specify
NETWORK_INTERFACE). More information can be found in the manual:
http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#SECTION00436000000000000000

Best of luck,
DJH