[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor 8.6.3 - Jobs evicted even if other slots are free



Because of your machine RANK _expression_, DetChar jobs will always preempt NoEMiTest jobs on any machine. So the key to preventing the undesirable preemptions is to ensure that when the negotiator sorts all of the machines that match a DetChar job, the idle machines appear at the top of the list. With the default settings and the local config changes listed, that should be happening. 

The configuration parameter NEGOTIATOR_PRE_JOB_RANK is how you enforce a particular ordering of machines for each job being matched. If all of your machines have the same RANK _expression_, then the default value for NEGOTIATOR_PRE_JOB_RANK should make the negotiator to prefer matching idle machines. Have you changed this setting?

 - Jaime

On Apr 1, 2019, at 7:37 AM, Nicolas Arnaud <narnaud@xxxxxxxxxxxx> wrote:


Hello,

I'd like to revive that thread: we couldn't solve the problem and so we still have jobs evicted for no obvious reason -- there are free slots in the system where the new jobs could run independently. And the longer a given job, the more likely it is too be affected by that issue => the affected dags take much longer to run than what they should.

Thanks in advance,

Nicolas

Le 13/03/2019 Ã 12:00, Giuseppe Di Biase a Ãcrit :
Hi All,
our HTCondor architecture consists of:
 * condorcl1:                         DAEMON_LIST    = MASTER,
   COLLECTOR, NEGOTIATOR, GANGLIAD, DEFRAG
 * submit1:                            DAEMON_LIST    = MASTER, SCHEDD
 * olnode1..olnode64:           DAEMON_LIST = MASTER, STARTD
*condorcl1 config.local is:*
COLLECTOR_NAME = $(CONDOR_HOST)
DAEMON_LIST    = MASTER, COLLECTOR, NEGOTIATOR, GANGLIAD, DEFRAG
DEFRAG_INTERVAL = 3600
DEFRAG_DRAINING_MACHINES_PER_HOUR = 1.0
DEFRAG_MAX_WHOLE_MACHINES = 20
DEFRAG_MAX_CONCURRENT_DRAINING = 10
DEFRAG_SCHEDULE = graceful
*submit1 config.local is:*
COLLECTOR_NAME = $(CONDOR_HOST)
DAEMON_LIST    = MASTER, SCHEDD
SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) CheckExp
SUBMIT_REQUIREMENT_CheckExp = JobUniverse == 5 || JobUniverse == 7
SUBMIT_REQUIREMET_CheckExp_REASON = "Submissions must have +Experiment"
EVENT_LOG = /virgoLog/HTCondor/event_log/events.log
*olnodeXX config.local is:*
COLLECTOR_NAME = $(CONDOR_HOST)
DAEMON_LIST = MASTER, STARTD
NUM_SLOTS = 1
SLOT_TYPE_1 = cpus=100%
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1_PARTITIONABLE = TRUE
SUSPEND_VANILLA = False
PREEEMPT_VANILLA = False
KILL_VANILLA = False
START = TRUE
IsNoEMi = (Experiment =?= "NoEMi")
IsDetChar = (Experiment =?= "DetChar")
IscWB = (Experiment =?= "cWB")
IsNoEMiTest = (Experiment =?= "NoEMiTest")
RANK = $(IsDetChar)*70 + $(IsNoEMi)*10 + $(IsNoEMiTest)*12 + $(IscWB)*8
GROUP_QUOTA_DYNAMIC_virgo.prod.o3.detchar.linefind.noemi = .30
GROUP_QUOTA_DYNAMIC_virgo.prod.o3.detchar.transient.dqr = .30
GROUP_QUOTA_DYNAMIC_virgo.prod.o3.burst.allsky.cwbonline = .40
In this configuration "NoEMiTest" jobs in "virgo.prod.o3.detchar.linefind.noemi" (AccountingGroup) are always evicted by jobs with high priority (Experiment=DetChar) because they what to run on the same machines even if there are others free machines.
Can you point me to find out where is the issue?
Thanks
Giuseppe
--
===============================================
Giuseppe Di Biase -giuseppe.dibiase@xxxxxxxxx
European Gravitational Observatory - EGO
Via E.Amaldi - 56021 Cascina (Pisa) - IT
Phone: +39 050 752 577
===============================================
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/