[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] Negotiator gets stuck
- Date: Fri, 18 Feb 2005 17:01:48 -0000
- From: "Andrey Kaliazin" <A.Kaliazin@xxxxxxxxxxx>
- Subject: RE: [Condor-users] Negotiator gets stuck
I do not doubt the Negotiator's logic in this case it is perfectly valid.
But I can see that I did
not explain the problem I have. Let me try again:
Negotiator hits the node which it cannot communicate with. Ok, leave it
alone until next
cycle and try it again in 5 minutes.
So far so good.
So let's negotiate the next job in the queue! But no, and this is what
bothers me -
Negotiator always quits the cycle immediately after one failure, as you can
see from the log below.
And after 5 minutes it starts the cycle and tries that previously failed
node, quits again...
Another 5 minutes and then it finally decides to drop the dead donkey and
get on with
the rest of jobs waiting impatiently in a queue... until it hits another
Another 15 minutes later the queue jerks forward and stop again and so on.
So this is the key point of my problem -
Negotiator quits the cycle immediately after one communication failure.
Is it something wrong with my server configuration?
It is running now the Condor 6.6.8 but it was the same with 6.7.3 and
previous versions as well.
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
> Sent: Friday, February 18, 2005 4:29 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Negotiator gets stuck
> On Fri, Feb 18, 2005 at 10:51:22AM -0000, Andrey Kaliazin wrote:
> > Dear all,
> > I hope somebody can clear up this situation for me.
> > It is more than often in our environment execute nodes die
> fir various
> > reasons. But the Condor
> > is designed to cope with exactly this kind of environment, right?
> > Then why the Negotiator is failing to bypass a single node,
> which it cannot
> > communicate with?
> > Instead it stops submission process altogether until this
> node in question
> > is dropped completely
> > from the pool. I only suppose that my config settings are
> not quite right
> > somewhere. Is there
> > anything I can change on the server to make Negotiator to
> disregard nodes it
> > is having difficulties to
> > communicate with?
> It is - Condor figures out that it can't talk to it, and then it
> ignores it for the rest of the negotiation cycle. In the stable
> case, the ad for the schedd it can't contact will drop from the
> collector in 15 minutes, so on average it will only try two
> negotiation cycles to contact it, if it's truly gone.
> I think that's probably the right thing to do - if we can't contact
> it in a negotiation cycle, give up for now. 5 minutes later, in the
> next negotiation cycle, try again, in case that eariler failure was
> temporary. If it's still down, ignore it again. In the next
> cycle, the ad will most likely be gone if the machine is really gone,
> and so we won't try and contact it.
> It'd certainly be nice if we did a non-blocking connect to the schedd,
> so we didn't wait a few minutes to figure out we can't connect to it,
> and hopefully someday we will.
> > Negotiator log entries -
> > ...
> > 2/18 10:28:37 Request 00005.00014:
> > 2/18 10:31:46 Can't connect to <220.127.116.11:9554>:0, errno = 110
> > 2/18 10:31:46 Will keep trying for 10 seconds...
> > 2/18 10:31:47 Connect failed for 10 seconds; returning FALSE
> > 2/18 10:31:47 ERROR:
> > SECMAN:2003:TCP connection to <18.104.22.168:9554> failed
> > 2/18 10:31:47 condor_write(): Socket closed when trying to
> write buffer
> > 2/18 10:31:47 Buf::write(): condor_write() failed
> > 2/18 10:31:47 Could not send PERMISSION
> > 2/18 10:31:47 Error: Ignoring schedd for this cycle
> > 2/18 10:31:47 ---------- Finished Negotiation Cycle ----------
> > Andrey
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> Condor-users mailing list