[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Need help with 7.0.1



A couple of weeks ago I upgraded our wintel condor pool from 6.8.3 to
7.0.1.  Now I get odd behavior and need help in figuring out where the
problem is.  I'm sure it's something in my setup but cannot figure out
what.  I'm using the similar submit files as before, changing any
mention of VM to SLOT.  I've compared our old condor_config files to the
ones in the 7.0.1 release and made the appropriate changes in our config
files.  Gone through the log files looking for obvious errors, don't see
any but I'm not a condor log file guru.

Symptoms:
-- 7.0.1 should recognize hyperthreaded machines and not count them as
slots, but it does not recognize them, even with 
COUNT_HYPERTHREAD_CPUS = FALSE.  For instance, for the machines listed
below, none should have slot 3 or 4.

-- One machine in the pool started making lots of claims with no job to
run; finally had to stop condor to stop the claims:
[eli@water does not have any submits going on]

4/14 09:09:46       Successfully matched with
slot3@xxxxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46     Request 00468.00131:
4/14 09:09:46       Matched 468.131 eli@xxxxxxxxxxxx
<136.200.xx.xxx:1037> preempting none <136.200.xx.xxx:1945>
slot4@xxxxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46       Successfully matched with
slot4@xxxxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46     Request 00468.00132:
4/14 09:09:46       Matched 468.132 eli@xxxxxxxxxxxx
<136.200.xx.xxx:1037> preempting none <136.200.xx.xxx:4516>
slot4@xxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46       Successfully matched with
slot4@xxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46     Request 00468.00133:
4/14 09:09:46       Matched 468.133 eli@xxxxxxxxxxxx
<136.200.xx.xxx:1037> preempting none <136.200.xx.xxx:4516>
slot3@xxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46       Successfully matched with
slot3@xxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46     Request 00468.00134:
4/14 09:09:46       Matched 468.134 eli@xxxxxxxxxxxx
<136.200.xx.xxx:1037> preempting none <136.200.228.40:3592>
slot4@xxxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46       Successfully matched with
slot4@xxxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46     Request 00468.00135:
4/14 09:09:46       Matched 468.135 eli@xxxxxxxxxxxx
<136.200.xx.xxx:1037> preempting none <136.200.228.40:3592>
slot3@xxxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46       Successfully matched with
slot3@xxxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46     Request 00468.00136:
4/14 09:09:46       Matched 468.136 eli@xxxxxxxxxxxx
<136.200.xx.xxx:1037> preempting none <136.200.xx.xxx:1036>
slot3@xxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46       Successfully matched with
slot3@xxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46     Request 00468.00137:
4/14 09:09:46       Matched 468.137 eli@xxxxxxxxxxxx
<136.200.xx.xxx:1037> preempting none <136.200.xx.xxx:1036>
slot4@xxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46       Successfully matched with
slot4@xxxxxxxxxxxxxxxxxxxxxx
4/14 09:09:46     Request 00468.00138:
4/14 09:09:46       Matched 468.138 eli@xxxxxxxxxxxx
<136.200.xx.xxx:1037> preempting none 

-- using a REQUIREMENTS in the submit file, which worked fine under
6.8.3, prevents jobs from matching any machine in the pool.  Without the
REQUIREMENTS, jobs will be matched.

.sub file:
Requirements = (Machine == "LOCKE.ad.water.xx.xxx")

$ condor_q -analyze 712

-- Submitter: ABBEY.ad.water.xx.xxx : <136.200.xx.xxx:1045> :
ABBEY.ad.water.xx.xxx
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

---
712.000:  Run analysis summary.  Of 56 machines,
     56 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job
	No successful match recorded.
	Last failed match: Mon Apr 14 08:49:43 2008
	Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints
   Check the Requirements expression below:

Requirements = ((Machine == "LOCKE.ad.water.xx.xxx")) && (Arch ==
"INTEL") && (OpSys == "WINNT51") && (Disk >= DiskUsage) && ((Memory *
1024) >= ImageSize) && (HasFileTransfer)

$ condor_status -l locke

MyType = "Machine"
TargetType = "Job"
Name = "slot1@xxxxxxxxxxxxxxxxxxxxx"
Rank = (10 * (Owner == "none"))
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.900000)
MyCurrentTime = 1208184528
Machine = "LOCKE.ad.water.xx.xxx"