[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] negotiating with schedds when a client has FW



Hello,

$CondorVersion: 6.7.6 Mar 15 2005 $
$CondorPlatform: I386-LINUX_RH9 $

I just wondered why my machines were'nt claimed even they were unclaimed and they had all requirements.

         IA64/LINUX       24    24       0         0       0          0
        INTEL/LINUX       60     6       1        53       0          0
      INTEL/WINNT50        2     0       0         2       0          0
      INTEL/WINNT51      163     0       2       161       0          0
       x86_64/LINUX        1     1       0         0       0          0

              Total      250    31       3       216       0          0


6907.002: Run analysis summary. Of 250 machines, 25 are rejected by your job's requirements 6 reject your job because of their own requirements 3 match but are serving users with a better priority in the pool 216 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 are available to run your job

[...]

6907.019:  Run analysis summary.  Of 250 machines,
    25 are rejected by your job's requirements
     6 reject your job because of their own requirements
     3 match but are serving users with a better priority in the pool
   216 match but reject the job for unknown reasons
     0 match but will not currently preempt their existing job
     0 are available to run your job

I took a look at the Negotiator log:
6/16 13:54:33 ---------- Started Negotiation Cycle ----------
6/16 13:54:33 Phase 1: Obtaining ads from collector ...
6/16 13:54:33 Getting all public ads ...
6/16 13:54:33 Sorting 366 ads ...
6/16 13:54:33 Getting startd private ads ...
6/16 13:54:33 Got ads: 366 public and 250 private
6/16 13:54:33 Public ads include 1 submitter, 250 startd
6/16 13:54:33 Phase 2: Performing accounting ...
6/16 13:54:33 Phase 3: Sorting submitter ads by priority ...
6/16 13:54:33 Phase 4.1: Negotiating with schedds ...
6/16 13:54:33 Negotiating with nobody@*** at <***.130.4.77:9601>
6/16 13:54:33 Request 06907.00000:
6/16 13:54:33 Matched 6907.0 nobody@*** <***.130.4.77:9601> preempting none <***.130.71.149:9620>
6/16 13:54:33 Successfully matched with vm1@pc49.***
6/16 13:54:33 Request 06907.00001:
6/16 13:54:33 Matched 6907.1 nobody@*** <***.130.4.77:9601> preempting none <***.130.71.149:9620>
6/16 13:54:33 Successfully matched with vm2@pc49.***
6/16 13:54:33 Request 06907.00002:
6/16 13:57:42 Can't connect to <***.130.71.139:10066>:0, errno = 110
6/16 13:57:42 Will keep trying for 10 seconds...
6/16 13:57:43 Connect failed for 10 seconds; returning FALSE
6/16 13:57:43 ERROR: SECMAN:2003:TCP connection to <***.130.71.139:10066> failed
6/16 13:57:43 condor_write(): Socket closed when trying to write buffer
6/16 13:57:43 Buf::write(): condor_write() failed
6/16 13:57:43 Could not send PERMISSION
6/16 13:57:43 Error: Ignoring schedd for this cycle
6/16 13:57:43 ---------- Finished Negotiation Cycle ----------


I checked ***.130.71.139 and noticed that the machine had a disfunctional network service - all requests were blocked although the machine (win xp) told me, the FW is off.
OK, lets assume ***.130.71.139 blocks every incoming traffic, but why aren't all the other jobs serviced (6907.002-6907.019) in the same cycle?
This job (6907) was finished after a while - but other entries in NegotiatorLog and MatchLog for that job weren't complete. Some processes of that cluster were serviced but not logged - maybe a bug.


My jobs have rank = kflops in the submit files. The machine ***.130.71.139 is one of the fastest (4th) and so condor tried to claim that machine in every negotiation cycle first, because the 3 fastest machines were already claimed. But that machine blocked all traffic, so condor stopped matchmaking and didn't look at the next free machine. So my whole cluster was only serviced by my 3 fastest machines - out of a pool with 216 other machines that matched and had nothing to do. That took a long time ;)

Suggestion: If Condor can't connect to a machine, it schould claim the next best free machine for a job instead of exit the cycle. Network problems could have big negative effects on the whole condor pool else.

regards
Thomas Lisson
NRW-Grid