[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] negotiating with schedds when a client has FW



Hi Thomas,

You just got the same problem I was hit by back in February (subject was
"Negotiator gets stuck"). 
Unfortunately I did not get satisfactory response from developers. The best
proposal was 
(from Chris Mellen) to use macro
 
NEGOTIATE_ALL_JOBS_IN_CLUSTER = True

in condor_config file where SCHEDD is running. 
This is very useful macro indeed, but not in this particular case.

It seems I have failed to persuade Nick LeRoy that this problem has nothing
to do with the 
Negotiator <-> Schedd talks, but with the Negotiator <-> Startd  part of
negotiation process.
Schedd is fine here, it provides the string of jobs to run and just waits
patiently, while Negotiator
dispatches them. If Start daemons respond properly everything is fine.
But, if one of the compute nodes which appears on top of the matched list
fails for various reasons 
(mainly networking problems in our case) then Negotiator would not just
dismiss it and get the next 
best node, but halts the whole cycle. 
And couple of minutes later, in the next cycle the story repeats itself, 
because this faulty node is still on top of the list. 

regards,

Andrey


> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Thomas Lisson
> Sent: Friday, June 17, 2005 3:11 PM
> To: Condor-Users Mail List
> Subject: [Condor-users] negotiating with schedds when a client has FW
> 
> Hello,
> 
> $CondorVersion: 6.7.6 Mar 15 2005 $
> $CondorPlatform: I386-LINUX_RH9 $
> 
> I just wondered why my machines were'nt claimed even they 
> were unclaimed 
> and they had all requirements.
> 
>           IA64/LINUX       24    24       0         0       0 
>          0
>          INTEL/LINUX       60     6       1        53       0 
>          0
>        INTEL/WINNT50        2     0       0         2       0 
>          0
>        INTEL/WINNT51      163     0       2       161       0 
>          0
>         x86_64/LINUX        1     1       0         0       0 
>          0
> 
>                Total      250    31       3       216       0 
>          0
> 
> 
> 6907.002:  Run analysis summary.  Of 250 machines,
>      25 are rejected by your job's requirements
>       6 reject your job because of their own requirements
>       3 match but are serving users with a better priority in the pool
>     216 match but reject the job for unknown reasons
>       0 match but will not currently preempt their existing job
>       0 are available to run your job
> 
> [...]
> 
> 6907.019:  Run analysis summary.  Of 250 machines,
>      25 are rejected by your job's requirements
>       6 reject your job because of their own requirements
>       3 match but are serving users with a better priority in the pool
>     216 match but reject the job for unknown reasons
>       0 match but will not currently preempt their existing job
>       0 are available to run your job
> 
> I took a look at the Negotiator log:
> 6/16 13:54:33 ---------- Started Negotiation Cycle ----------
> 6/16 13:54:33 Phase 1:  Obtaining ads from collector ...
> 6/16 13:54:33   Getting all public ads ...
> 6/16 13:54:33   Sorting 366 ads ...
> 6/16 13:54:33   Getting startd private ads ...
> 6/16 13:54:33 Got ads: 366 public and 250 private
> 6/16 13:54:33 Public ads include 1 submitter, 250 startd
> 6/16 13:54:33 Phase 2:  Performing accounting ...
> 6/16 13:54:33 Phase 3:  Sorting submitter ads by priority ...
> 6/16 13:54:33 Phase 4.1:  Negotiating with schedds ...
> 6/16 13:54:33   Negotiating with nobody@*** at <***.130.4.77:9601>
> 6/16 13:54:33     Request 06907.00000:
> 6/16 13:54:33       Matched 6907.0 nobody@*** <***.130.4.77:9601> 
> preempting none <***.130.71.149:9620>
> 6/16 13:54:33       Successfully matched with vm1@pc49.***
> 6/16 13:54:33     Request 06907.00001:
> 6/16 13:54:33       Matched 6907.1 nobody@*** <***.130.4.77:9601> 
> preempting none <***.130.71.149:9620>
> 6/16 13:54:33       Successfully matched with vm2@pc49.***
> 6/16 13:54:33     Request 06907.00002:
> 6/16 13:57:42 Can't connect to <***.130.71.139:10066>:0, errno = 110
> 6/16 13:57:42 Will keep trying for 10 seconds...
> 6/16 13:57:43 Connect failed for 10 seconds; returning FALSE
> 6/16 13:57:43 ERROR: SECMAN:2003:TCP connection to 
> <***.130.71.139:10066> failed
> 6/16 13:57:43 condor_write(): Socket closed when trying to 
> write buffer
> 6/16 13:57:43 Buf::write(): condor_write() failed
> 6/16 13:57:43       Could not send PERMISSION
> 6/16 13:57:43   Error: Ignoring schedd for this cycle
> 6/16 13:57:43 ---------- Finished Negotiation Cycle ----------
> 
> I checked ***.130.71.139 and noticed that the machine had a 
> disfunctional network service - all requests were blocked 
> although the 
> machine (win xp) told me, the FW is off.
> OK, lets assume ***.130.71.139 blocks every incoming traffic, but why 
> aren't all the other jobs serviced (6907.002-6907.019) in the 
> same cycle?
> This job (6907) was finished after a while  - but other entries in 
> NegotiatorLog and MatchLog for that job weren't complete. 
> Some processes 
> of that cluster were serviced but not logged - maybe a bug.
> 
> My jobs have rank = kflops in the submit files. The machine 
> ***.130.71.139 is one of the fastest (4th) and so condor 
> tried to claim 
> that machine in every negotiation cycle first, because the 3 fastest 
> machines were already claimed. But that machine blocked all 
> traffic, so 
> condor stopped matchmaking and didn't look at the next free 
> machine. So 
> my whole cluster was only serviced by my 3 fastest machines - 
> out of a 
> pool with 216 other machines that matched and had nothing to do. That 
> took a long time ;)
> 
> Suggestion: If Condor can't connect to a machine, it schould 
> claim the 
> next best free machine for a job instead of exit the cycle. Network 
> problems could have big negative effects on the whole condor 
> pool else.
> 
> regards
> Thomas Lisson
> NRW-Grid
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
>