[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Negotiator gets stuck
- Date: Fri, 18 Feb 2005 10:28:40 -0600
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Negotiator gets stuck
On Fri, Feb 18, 2005 at 10:51:22AM -0000, Andrey Kaliazin wrote:
> Dear all,
> I hope somebody can clear up this situation for me.
> It is more than often in our environment execute nodes die fir various
> reasons. But the Condor
> is designed to cope with exactly this kind of environment, right?
> Then why the Negotiator is failing to bypass a single node, which it cannot
> communicate with?
> Instead it stops submission process altogether until this node in question
> is dropped completely
> from the pool. I only suppose that my config settings are not quite right
> somewhere. Is there
> anything I can change on the server to make Negotiator to disregard nodes it
> is having difficulties to
> communicate with?
It is - Condor figures out that it can't talk to it, and then it
ignores it for the rest of the negotiation cycle. In the stable
case, the ad for the schedd it can't contact will drop from the
collector in 15 minutes, so on average it will only try two
negotiation cycles to contact it, if it's truly gone.
I think that's probably the right thing to do - if we can't contact
it in a negotiation cycle, give up for now. 5 minutes later, in the
next negotiation cycle, try again, in case that eariler failure was
temporary. If it's still down, ignore it again. In the next negotiation
cycle, the ad will most likely be gone if the machine is really gone,
and so we won't try and contact it.
It'd certainly be nice if we did a non-blocking connect to the schedd,
so we didn't wait a few minutes to figure out we can't connect to it,
and hopefully someday we will.
> Negotiator log entries -
> 2/18 10:28:37 Request 00005.00014:
> 2/18 10:31:46 Can't connect to <126.96.36.199:9554>:0, errno = 110
> 2/18 10:31:46 Will keep trying for 10 seconds...
> 2/18 10:31:47 Connect failed for 10 seconds; returning FALSE
> 2/18 10:31:47 ERROR:
> SECMAN:2003:TCP connection to <188.8.131.52:9554> failed
> 2/18 10:31:47 condor_write(): Socket closed when trying to write buffer
> 2/18 10:31:47 Buf::write(): condor_write() failed
> 2/18 10:31:47 Could not send PERMISSION
> 2/18 10:31:47 Error: Ignoring schedd for this cycle
> 2/18 10:31:47 ---------- Finished Negotiation Cycle ----------
> Condor-users mailing list