[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Negotiation and group_quota issue



Hi All,
I'm setting up a new additional negotiator for a small amount gpus servers.
GROUP_ACCEPT_SURPLUS = True

The Issus is that sometimes the group quota for a group is "-nan"

I'm sending many jobs single core and I'm able to run 3 jobs which is my quota (The pool is empty) I would expect to get 12 jobs running.
When I'm restarting one of the execution points "systemctl restart condor" all the 12 jobs will be running,
Removing the jobs and sending it again after a few time and the quota limit will be nan again,

Negotiator log. Grep "Group physics"

--------------------Sent  many jobs 3 are running expected 12.
10:37:06 Group physics - BEGIN NEGOTIATION with a quota limit of 3.54984
10:37:06 Group physics is using its quota 3 - halting negotiation
10:37:06 Group physics - BEGIN NEGOTIATION with a quota limit of -nan
10:37:14 Group physics - BEGIN NEGOTIATION with a quota limit of 3.54984
10:37:14 Group physics is using its quota 3 - halting negotiation
10:37:14 Group physics - BEGIN NEGOTIATION with a quota limit of -nan
10:37:23 Group physics - BEGIN NEGOTIATION with a quota limit of 3.54984
10:37:23 Group physics is using its quota 3 - halting negotiation

--------------------Restart condor on a single EP
10:37:23 Group physics - BEGIN NEGOTIATION with a quota limit of 12
10:37:23 Group physics - skipping, no submitters (usage=8)
10:37:30 Group physics - BEGIN NEGOTIATION with a quota limit of 3.54984
10:37:30 Group physics is using its quota 3 - halting negotiation
10:37:30 Group physics - BEGIN NEGOTIATION with a quota limit of 12
10:37:30 Group physics is using its quota 12 - halting negotiation

This issue is not related to gpus.
I have seen this issue before on a large pool and it disappear.

Probably It's something with the configuration but I can think of something that will trigger that after few
Happens on 9 and 23 versions.

I will keep digging.

David