[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] negotiating with schedds when a client has FW



On Fri, Jun 17, 2005 at 03:59:56PM +0100, Andrey Kaliazin wrote:
> Hi Thomas,
> 
> You just got the same problem I was hit by back in February (subject was
> "Negotiator gets stuck"). 
> Unfortunately I did not get satisfactory response from developers. The best
> proposal was 
> (from Chris Mellen) to use macro
>  
> NEGOTIATE_ALL_JOBS_IN_CLUSTER = True
> 
> in condor_config file where SCHEDD is running. 
> This is very useful macro indeed, but not in this particular case.
> 
> It seems I have failed to persuade Nick LeRoy that this problem has nothing
> to do with the 
> Negotiator <-> Schedd talks, but with the Negotiator <-> Startd  part of
> negotiation process.

To defend Nick a bit here, few of us on the Condor Team believed you :)

It's certainly not designed to happen that way, and the code says it can't,
but we understand how it happens. 

> Schedd is fine here, it provides the string of jobs to run and just waits
> patiently, while Negotiator
> dispatches them. If Start daemons respond properly everything is fine.
> But, if one of the compute nodes which appears on top of the matched list
> fails for various reasons 
> (mainly networking problems in our case) then Negotiator would not just
> dismiss it and get the next 
> best node, but halts the whole cycle. 

Well, it doesn't halt the whole cycle, but it drops the schedd for that
cycle. (And if you've only got one schedd, that effectively ends the whole
cycle)

The problem is a confluence of timeouts. The message to the startd, telling
it that it's been matched, is sent as a UDP packet and isn't supposed to
block (it's not integral to the matchmaking protocol that the startd
recieve this message from the negotiator). The UDP packet isn't supposed 
to block when we send it. However - if it's the first time the negotiator
has sent a UDP packet to the startd, it first establishes a TCP connection
to the startd to create a security session - and that can block. With the
firewall there, it can be 10 seconds before the TCP connect fails, and 
we get back to the negotiator with an error - which means we drop that
startd from the lits of things we're considering for this cycle and we go
on to the best machine to make the match, like we've always done and like
everyone expects us to do.

HOWEVER - back on the ranch at the schedd, no one's heard from the
negotiator in a while (because it's been busy trying to connect to blocked
startds). It turns out that we ship by default a config file that says
"never wait more than 20 seconds for the negotiator to tell you something", 
so after 20 seconds of not hearing from the negotiator, the schedd closes
the connection. The negotiator, meanwhile, is making another match for the
schedd, and once it finds one it goes to tell the schedd - and discovers that
the socket is closed, and it prints out an "Error: ignoring schedd for the
cycle".

The workaround is to increase your NEGOTIATOR_TIMEOUT setting on 
submit machines. Just to be safe, give it 45 or 60 seconds. Don't
mess with it on the central manager. 

-Erik

> And couple of minutes later, in the next cycle the story repeats itself, 
> because this faulty node is still on top of the list. 
> 
> regards,
> 
> Andrey
> 
> 
> > -----Original Message-----
> > From: condor-users-bounces@xxxxxxxxxxx 
> > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Thomas Lisson
> > Sent: Friday, June 17, 2005 3:11 PM
> > To: Condor-Users Mail List
> > Subject: [Condor-users] negotiating with schedds when a client has FW
> > 
> > Hello,
> > 
> > $CondorVersion: 6.7.6 Mar 15 2005 $
> > $CondorPlatform: I386-LINUX_RH9 $
> > 
> > I just wondered why my machines were'nt claimed even they 
> > were unclaimed 
> > and they had all requirements.
> > 
> >           IA64/LINUX       24    24       0         0       0 
> >          0
> >          INTEL/LINUX       60     6       1        53       0 
> >          0
> >        INTEL/WINNT50        2     0       0         2       0 
> >          0
> >        INTEL/WINNT51      163     0       2       161       0 
> >          0
> >         x86_64/LINUX        1     1       0         0       0 
> >          0
> > 
> >                Total      250    31       3       216       0 
> >          0
> > 
> > 
> > 6907.002:  Run analysis summary.  Of 250 machines,
> >      25 are rejected by your job's requirements
> >       6 reject your job because of their own requirements
> >       3 match but are serving users with a better priority in the pool
> >     216 match but reject the job for unknown reasons
> >       0 match but will not currently preempt their existing job
> >       0 are available to run your job
> > 
> > [...]
> > 
> > 6907.019:  Run analysis summary.  Of 250 machines,
> >      25 are rejected by your job's requirements
> >       6 reject your job because of their own requirements
> >       3 match but are serving users with a better priority in the pool
> >     216 match but reject the job for unknown reasons
> >       0 match but will not currently preempt their existing job
> >       0 are available to run your job
> > 
> > I took a look at the Negotiator log:
> > 6/16 13:54:33 ---------- Started Negotiation Cycle ----------
> > 6/16 13:54:33 Phase 1:  Obtaining ads from collector ...
> > 6/16 13:54:33   Getting all public ads ...
> > 6/16 13:54:33   Sorting 366 ads ...
> > 6/16 13:54:33   Getting startd private ads ...
> > 6/16 13:54:33 Got ads: 366 public and 250 private
> > 6/16 13:54:33 Public ads include 1 submitter, 250 startd
> > 6/16 13:54:33 Phase 2:  Performing accounting ...
> > 6/16 13:54:33 Phase 3:  Sorting submitter ads by priority ...
> > 6/16 13:54:33 Phase 4.1:  Negotiating with schedds ...
> > 6/16 13:54:33   Negotiating with nobody@*** at <***.130.4.77:9601>
> > 6/16 13:54:33     Request 06907.00000:
> > 6/16 13:54:33       Matched 6907.0 nobody@*** <***.130.4.77:9601> 
> > preempting none <***.130.71.149:9620>
> > 6/16 13:54:33       Successfully matched with vm1@pc49.***
> > 6/16 13:54:33     Request 06907.00001:
> > 6/16 13:54:33       Matched 6907.1 nobody@*** <***.130.4.77:9601> 
> > preempting none <***.130.71.149:9620>
> > 6/16 13:54:33       Successfully matched with vm2@pc49.***
> > 6/16 13:54:33     Request 06907.00002:
> > 6/16 13:57:42 Can't connect to <***.130.71.139:10066>:0, errno = 110
> > 6/16 13:57:42 Will keep trying for 10 seconds...
> > 6/16 13:57:43 Connect failed for 10 seconds; returning FALSE
> > 6/16 13:57:43 ERROR: SECMAN:2003:TCP connection to 
> > <***.130.71.139:10066> failed
> > 6/16 13:57:43 condor_write(): Socket closed when trying to 
> > write buffer
> > 6/16 13:57:43 Buf::write(): condor_write() failed
> > 6/16 13:57:43       Could not send PERMISSION
> > 6/16 13:57:43   Error: Ignoring schedd for this cycle
> > 6/16 13:57:43 ---------- Finished Negotiation Cycle ----------
> > 
> > I checked ***.130.71.139 and noticed that the machine had a 
> > disfunctional network service - all requests were blocked 
> > although the 
> > machine (win xp) told me, the FW is off.
> > OK, lets assume ***.130.71.139 blocks every incoming traffic, but why 
> > aren't all the other jobs serviced (6907.002-6907.019) in the 
> > same cycle?
> > This job (6907) was finished after a while  - but other entries in 
> > NegotiatorLog and MatchLog for that job weren't complete. 
> > Some processes 
> > of that cluster were serviced but not logged - maybe a bug.
> > 
> > My jobs have rank = kflops in the submit files. The machine 
> > ***.130.71.139 is one of the fastest (4th) and so condor 
> > tried to claim 
> > that machine in every negotiation cycle first, because the 3 fastest 
> > machines were already claimed. But that machine blocked all 
> > traffic, so 
> > condor stopped matchmaking and didn't look at the next free 
> > machine. So 
> > my whole cluster was only serviced by my 3 fastest machines - 
> > out of a 
> > pool with 216 other machines that matched and had nothing to do. That 
> > took a long time ;)
> > 
> > Suggestion: If Condor can't connect to a machine, it schould 
> > claim the 
> > next best free machine for a job instead of exit the cycle. Network 
> > problems could have big negative effects on the whole condor 
> > pool else.
> > 
> > regards
> > Thomas Lisson
> > NRW-Grid
> > 
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> > 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users