[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fair-share limits reached while there are whole machines are available and idle jobs



Hi Alec,

How many slots do you have in the pool? The quota for the <none> group should be all of the available slots. If you have significantly more than theÂ2629 slots shown, you may be filtering out slots that shouldn't be. You could check whatÂNEGOTIATOR_SLOT_CONSTRAINT,ÂNEGOTIATOR_SLOT_POOLSIZE_CONSTRAINT, and GROUP_DYNAMIC_MACH_CONSTRAINT are set to.

Are you using weighted slots? If so, make sure everything is weighting them the same way (Jobs, Schedds, and the Slots themselves). If you're using an _expression_ for RequestCpus/Memory/Disk that references an attribute of the target slot, and that attribute is part of your weight, make sure to re-define the schedd slot weight in a way that only uses the job's attributes.

Best,
Collin

On Tue, Nov 20, 2018 at 1:47 PM Alec Sheperd <alec.sheperd@xxxxxxxxxxxxxxxx> wrote:
Hello,

I recently noticed something strange with our condor pool. There are a
lot of idle jobs in the queue and yet there are nearly equally many
available slots. Whole machines even, where there are no jobs running,
and yet
none of the idle jobs get allocated one of these empty slots.

After digging around in the negotiator logs and classads, it seems there
are a lot of jobs that are being rejected based on fair-share limits.
There are many more rejections happening than matches, and as far as I
can tell they are due to fair-share limits.
ÂFrom the LastNegotiationCycleSubmittersShareLimit* classsad, it seems
like all the ones being rejected are in the list provided from it.

These jobs are all getting submitted from the default <none> group which
has the surplus flag set. In the negotiator log it displays "Group
<none> is using its quota 2629 - halting negotiation".

Could it be something wrong with user prio and quotas disallowing slot
matches? Also wonder if maybe it's related to bug fixed in 8.7.10
(https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6714)
(https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6750)

Thanks for any help or thoughts,

Alec


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Collin Mehring | PE-JoSE - Software Engineer