[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] negotiating with schedds when a client has FW



Thanks Erik,

Your detailed explanation does shed light on this mystery.
Unfortunately (or fortunately for the users here) some recent changes to our
network infrastructure 
removed a lot of problems, thus diminishing a number of the said faults to
practically nil. So it is difficult to
reproduce this error immediately to verify your cure. But I will keep an eye
on it and report results
to this forum if the problem persists.

cheers,

Andrey

PS. 
> To defend Nick a bit here, few of us on the Condor Team believed you :)

It is a bit disappointing to hear that. :-(

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
> Sent: Friday, June 24, 2005 6:39 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] negotiating with schedds when a 
> client has FW
> 
> On Fri, Jun 17, 2005 at 03:59:56PM +0100, Andrey Kaliazin wrote:
> > Hi Thomas,
> > 
> > You just got the same problem I was hit by back in February 
> (subject was
> > "Negotiator gets stuck"). 
> > Unfortunately I did not get satisfactory response from 
> developers. The best
> > proposal was 
> > (from Chris Mellen) to use macro
> >  
> > NEGOTIATE_ALL_JOBS_IN_CLUSTER = True
> > 
> > in condor_config file where SCHEDD is running. 
> > This is very useful macro indeed, but not in this particular case.
> > 
> > It seems I have failed to persuade Nick LeRoy that this 
> problem has nothing
> > to do with the 
> > Negotiator <-> Schedd talks, but with the Negotiator <-> 
> Startd  part of
> > negotiation process.
> 
> To defend Nick a bit here, few of us on the Condor Team 
> believed you :)
> 
> It's certainly not designed to happen that way, and the code 
> says it can't,
> but we understand how it happens. 
> 
> > Schedd is fine here, it provides the string of jobs to run 
> and just waits
> > patiently, while Negotiator
> > dispatches them. If Start daemons respond properly 
> everything is fine.
> > But, if one of the compute nodes which appears on top of 
> the matched list
> > fails for various reasons 
> > (mainly networking problems in our case) then Negotiator 
> would not just
> > dismiss it and get the next 
> > best node, but halts the whole cycle. 
> 
> Well, it doesn't halt the whole cycle, but it drops the 
> schedd for that
> cycle. (And if you've only got one schedd, that effectively 
> ends the whole
> cycle)
> 
> The problem is a confluence of timeouts. The message to the 
> startd, telling
> it that it's been matched, is sent as a UDP packet and isn't 
> supposed to
> block (it's not integral to the matchmaking protocol that the startd
> recieve this message from the negotiator). The UDP packet 
> isn't supposed 
> to block when we send it. However - if it's the first time 
> the negotiator
> has sent a UDP packet to the startd, it first establishes a 
> TCP connection
> to the startd to create a security session - and that can 
> block. With the
> firewall there, it can be 10 seconds before the TCP connect 
> fails, and 
> we get back to the negotiator with an error - which means we drop that
> startd from the lits of things we're considering for this 
> cycle and we go
> on to the best machine to make the match, like we've always 
> done and like
> everyone expects us to do.
> 
> HOWEVER - back on the ranch at the schedd, no one's heard from the
> negotiator in a while (because it's been busy trying to 
> connect to blocked
> startds). It turns out that we ship by default a config file that says
> "never wait more than 20 seconds for the negotiator to tell 
> you something", 
> so after 20 seconds of not hearing from the negotiator, the 
> schedd closes
> the connection. The negotiator, meanwhile, is making another 
> match for the
> schedd, and once it finds one it goes to tell the schedd - 
> and discovers that
> the socket is closed, and it prints out an "Error: ignoring 
> schedd for the
> cycle".
> 
> The workaround is to increase your NEGOTIATOR_TIMEOUT setting on 
> submit machines. Just to be safe, give it 45 or 60 seconds. Don't
> mess with it on the central manager. 
> 
> -Erik
> 
> > And couple of minutes later, in the next cycle the story 
> repeats itself, 
> > because this faulty node is still on top of the list. 
> > 
> > regards,
> > 
> > Andrey
> > 
> > 
> > > -----Original Message-----
> > > From: condor-users-bounces@xxxxxxxxxxx 
> > > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of 
> Thomas Lisson
> > > Sent: Friday, June 17, 2005 3:11 PM
> > > To: Condor-Users Mail List
> > > Subject: [Condor-users] negotiating with schedds when a 
> client has FW
> > > 
> > > Hello,
> > > 
> > > $CondorVersion: 6.7.6 Mar 15 2005 $
> > > $CondorPlatform: I386-LINUX_RH9 $
> > > 
> > > I just wondered why my machines were'nt claimed even they 
> > > were unclaimed 
> > > and they had all requirements.
> > > 
> > >           IA64/LINUX       24    24       0         0       0 
> > >          0
> > >          INTEL/LINUX       60     6       1        53       0 
> > >          0
> > >        INTEL/WINNT50        2     0       0         2       0 
> > >          0
> > >        INTEL/WINNT51      163     0       2       161       0 
> > >          0
> > >         x86_64/LINUX        1     1       0         0       0 
> > >          0
> > > 
> > >                Total      250    31       3       216       0 
> > >          0
> > > 
> > > 
> > > 6907.002:  Run analysis summary.  Of 250 machines,
> > >      25 are rejected by your job's requirements
> > >       6 reject your job because of their own requirements
> > >       3 match but are serving users with a better 
> priority in the pool
> > >     216 match but reject the job for unknown reasons
> > >       0 match but will not currently preempt their existing job
> > >       0 are available to run your job
> > > 
> > > [...]
> > > 
> > > 6907.019:  Run analysis summary.  Of 250 machines,
> > >      25 are rejected by your job's requirements
> > >       6 reject your job because of their own requirements
> > >       3 match but are serving users with a better 
> priority in the pool
> > >     216 match but reject the job for unknown reasons
> > >       0 match but will not currently preempt their existing job
> > >       0 are available to run your job
> > > 
> > > I took a look at the Negotiator log:
> > > 6/16 13:54:33 ---------- Started Negotiation Cycle ----------
> > > 6/16 13:54:33 Phase 1:  Obtaining ads from collector ...
> > > 6/16 13:54:33   Getting all public ads ...
> > > 6/16 13:54:33   Sorting 366 ads ...
> > > 6/16 13:54:33   Getting startd private ads ...
> > > 6/16 13:54:33 Got ads: 366 public and 250 private
> > > 6/16 13:54:33 Public ads include 1 submitter, 250 startd
> > > 6/16 13:54:33 Phase 2:  Performing accounting ...
> > > 6/16 13:54:33 Phase 3:  Sorting submitter ads by priority ...
> > > 6/16 13:54:33 Phase 4.1:  Negotiating with schedds ...
> > > 6/16 13:54:33   Negotiating with nobody@*** at <***.130.4.77:9601>
> > > 6/16 13:54:33     Request 06907.00000:
> > > 6/16 13:54:33       Matched 6907.0 nobody@*** <***.130.4.77:9601> 
> > > preempting none <***.130.71.149:9620>
> > > 6/16 13:54:33       Successfully matched with vm1@pc49.***
> > > 6/16 13:54:33     Request 06907.00001:
> > > 6/16 13:54:33       Matched 6907.1 nobody@*** <***.130.4.77:9601> 
> > > preempting none <***.130.71.149:9620>
> > > 6/16 13:54:33       Successfully matched with vm2@pc49.***
> > > 6/16 13:54:33     Request 06907.00002:
> > > 6/16 13:57:42 Can't connect to <***.130.71.139:10066>:0, 
> errno = 110
> > > 6/16 13:57:42 Will keep trying for 10 seconds...
> > > 6/16 13:57:43 Connect failed for 10 seconds; returning FALSE
> > > 6/16 13:57:43 ERROR: SECMAN:2003:TCP connection to 
> > > <***.130.71.139:10066> failed
> > > 6/16 13:57:43 condor_write(): Socket closed when trying to 
> > > write buffer
> > > 6/16 13:57:43 Buf::write(): condor_write() failed
> > > 6/16 13:57:43       Could not send PERMISSION
> > > 6/16 13:57:43   Error: Ignoring schedd for this cycle
> > > 6/16 13:57:43 ---------- Finished Negotiation Cycle ----------
> > > 
> > > I checked ***.130.71.139 and noticed that the machine had a 
> > > disfunctional network service - all requests were blocked 
> > > although the 
> > > machine (win xp) told me, the FW is off.
> > > OK, lets assume ***.130.71.139 blocks every incoming 
> traffic, but why 
> > > aren't all the other jobs serviced (6907.002-6907.019) in the 
> > > same cycle?
> > > This job (6907) was finished after a while  - but other 
> entries in 
> > > NegotiatorLog and MatchLog for that job weren't complete. 
> > > Some processes 
> > > of that cluster were serviced but not logged - maybe a bug.
> > > 
> > > My jobs have rank = kflops in the submit files. The machine 
> > > ***.130.71.139 is one of the fastest (4th) and so condor 
> > > tried to claim 
> > > that machine in every negotiation cycle first, because 
> the 3 fastest 
> > > machines were already claimed. But that machine blocked all 
> > > traffic, so 
> > > condor stopped matchmaking and didn't look at the next free 
> > > machine. So 
> > > my whole cluster was only serviced by my 3 fastest machines - 
> > > out of a 
> > > pool with 216 other machines that matched and had nothing 
> to do. That 
> > > took a long time ;)
> > > 
> > > Suggestion: If Condor can't connect to a machine, it schould 
> > > claim the 
> > > next best free machine for a job instead of exit the 
> cycle. Network 
> > > problems could have big negative effects on the whole condor 
> > > pool else.
> > > 
> > > regards
> > > Thomas Lisson
> > > NRW-Grid
> > > 
> > > _______________________________________________
> > > Condor-users mailing list
> > > Condor-users@xxxxxxxxxxx
> > > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > > 
> > > 
> > 
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>