[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] negotiation weirdness



I'm seeing somewhat strange results in job negotiation/scheduling.

We're running a small (~60-node) condor cluster on a dozen or so windows
boxes.  One box (crossroads) is the central manager (submit,manage), and
the rest are all dedicated submit,execute machines with preemption
turned off.  (The node config can be seen in
http://www.grantgoodyear.org/~grant/condorlogs/condor_config.txt )
When one user submits a large number of jobs, we're seeing his jobs get
scheduled despite the fact that other users have better priorities.

Here's a 10-minute view of what's running and the user priorities:

Oct. 30, 10:40am
http://www.grantgoodyear.org/~grant/condorlogs/running_200710301040.txt
http://www.grantgoodyear.org/~grant/condorlogs/priorities_200710301040.txt

Oct. 30, 10:50am
http://www.grantgoodyear.org/~grant/condorlogs/running_200710301050.txt
http://www.grantgoodyear.org/~grant/condorlogs/priorities_200710301050.txt

We script the submission files, and use group accounting, so even though
all jobs have the same owner, all of the jobs run from c:\sergey have 
+AccountingGroup = "sergey" set, the c:\jgalford jobs are in the
"jgalford" group, and the c:\ljacobson job is in the "ljacobson" group.

At 10:40, sergey has an effective priority of 9.57, jobs 52800-52877
(submitted on crossroads) are running, and jobs 52878-53481 (crossroads)
are waiting.  Group ljacobson has job 270 (submitted from littleboy)
running, and nothing waiting in the queue.  His priority is 0.51, but
since he has nothing waiting it doesn't matter.  Group jgalford has job 498
(submitted from fatman) running, jobs 483-487 (submitted from
greenhouse) running, and jobs 499-514 (submitted from fatman) waiting.
The jgalford effective priority is 3.66.

So, if I understand the way the negotiation process works, the waiting
jobs should be sorted so that the jgalford job 499 (fatman) should be
the next job chosen when a resource frees up, and that would be followed
by 500 (fatman), ....

At 10:50, sergey jobs 52800-52808 (crossroads) have finished, and now
sergey jobs 52809-52904 (crossroads) are running.  No new jgalford
jobs have started, despite the lower effective priority.

I've included the crossroads log files
(http://www.grantgoodyear.org/~grant/condorlogs/) for this time 
period.  I'm not seeing anything in the logs that explains this
behavior, but I'm hoping somebody else has better insight.

I'm thoroughly confused.

Help?

Thanks,
Grant Goodyear
-- 
Grant Goodyear		
web: http://www.grantgoodyear.org	
e-mail: grant@xxxxxxxxxxxxxxxxx