[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Trying to figure out how rank works when submitting

Hi Greg, thanks for the clarification.

I'm trying to do something similar to Fabrice, except with large-memory and GPU-equipped machines, instead of different countries, and I'm having some trouble getting my pre-job-rank to steer things as expected, and I'm not sure why.

To keep things simple, I'll focus on my wish to have non-GPU jobs only consider GPU machines as a last possible resort.

To that end, I have a clause in my pre-job rank to reduce the rank of all GPU-equipped machines:

( -10e6 * (!isUndefined(MY.TotalGPUs) && MY.TotalGPUs > 0) ) 

Since jobs which require a GPU will only match to machines which have a GPU, this expression simply sets a -10e6 baseline rank for all eligible machines for a GPU job, and thus makes no difference in the calculations when applied to GPU-required jobs.

I am using partitionable slots, and I have claim_partitionable_leftovers set to true, through the 8.8 "use feature : PartitionableSlot(1)" config.

I have 16 non-GPU machines and 4 GPU machines.

When I submit a "sleep" job consisting of 20 procs:

    Executable = /bin/sleep
    Arguments = 5m
    Queue 20

... instead of all the jobs ending up on non-GPU machines, some of them are matched to GPU machines.

My understanding from the manual is that only the top machines with equal pre-job ranks will be considered by the job rank, then the post-job rank expressions.

I had thought this might have something to do with the group_quota_round_robin_rate based on the description of the value in the manual, " Setting GROUP_QUOTA_ROUND_ROBIN_RATE to a value that is small compared to the size of subsets of machines" - but setting the rate to 16, or 12, didn't seem to prevent my jobs from matching GPU machines.

I'd appreciate any insights.

Michael V Pelletier
Principal Engineer

Raytheon Technologies
Information Technology
Digital Transormation & Innovation