[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] dSlots and RANK expression



Hi all

I'm currently stuck with 7.6.6 and dynamic slots.

We have a set of very important jobs which we tried to push into our cluster 
with a very good effective prio (factor 1 instead of 100 or 1000 for other 
users). However, this job requires all 4 cores on a single machine and is 
being starved for CPU cores:

condor_q -b yields:

25879.000:  Run analysis summary.  Of 7599 machines,
      0 are rejected by your job's requirements 
   7370 reject your job because of their own requirements 
      2 match but are serving users with a better priority in the pool 
      0 match but reject the job for unknown reasons 
     74 match but will not currently preempt their existing job 
      0 match but are currently offline 
    153 are available to run your job


but as these 153 are not on the same machine, the jobs will not start to run.

Question (a): Why is this job not running, is there a way to let Condor 
"move"/preempt remote std.universe jobs to make room for this job?

Thus, I thought I could help Condor a bit by adding a few nodes with special 
START/RANK settings, e.g. instead of

START=TRUE
RANK=0.0

I tried various versions like:
START = (Target.RemoteOwner == "highprio@xxxxxxxxxxx") || (Target.JobUniverse 
== 1)
RANK = ( Owner =?= "highprio@xxxxxxxxxxx" )

the start expression works ok - at least from what I've seen - that only std 
universe jobs get matched, but these are not pushed from the machine by the 
waiting jobs submitted by "highprio".

question (b): Is machine RANK evaluate on a "subslot" basis? I think this 
would explain why there will never be a match by the "RequestCpus=4" job from 
aboce, when the subslot has fewer cores

question (c) Do I need condor_defrag from 7.7? If yes, is it considered safe 
to run 7.6 submit and execute machines along with a few 7.7 execute machines?

question (d): Is my way of thinking totally flawed?

Cheers

carsten