Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Jobs massively killed with PeriodicRemove
- Date: Thu, 24 Mar 2022 10:15:00 +0100
- From: Edith Knoops <knoops@xxxxxxxxxxxxx>
- Subject: [HTCondor-users] Jobs massively killed with PeriodicRemove
Hi,
We have a problem with our WLCG T2 IN2P3-CPPM. After activating defrag,
jobs are killed with "The job attribute PeriodicRemove expression
'(JobStatus == 1 && NumJobStarts > 0)' evaluated to TRUE"
We use DEFRAG_SCHEDULE = graceful . Full config file below 1).
The condor version was 8.8 until yesterday where we upgrade to 9.0.11.
The problem is the same with both version.
If we deactivate defrag, the problem disappear.
NB: the command condor_drain also killed all the running jobs
Have you an idea how we can have defrag working with a "real" gracefull
drain ?
We need defrag to have enough 8 core jobs of ATLAS with LHCB jobs
running also.
Thanks
Edith & Carlos
1)
DAEMON_LIST = $(DAEMON_LIST) DEFRAG
DEFRAG_INTERVAL = 300
#DEFRAG_DRAINING_MACHINES_PER_HOUR = 30.0
DEFRAG_DRAINING_MACHINES_PER_HOUR = 30.0
#DEFRAG_MAX_CONCURRENT_DRAINING = 60
DEFRAG_MAX_CONCURRENT_DRAINING = 40
#DEFRAG_MAX_WHOLE_MACHINES = 300
DEFRAG_MAX_WHOLE_MACHINES = 90
DEFRAG_SCHEDULE = graceful
## Allow some defrag configuration to be settable
DEFRAG.SETTABLE_ATTRS_ADMINISTRATOR =
DEFRAG_MAX_CONCURRENT_DRAINING,DEFRAG_DRAINING_MACHINES_PER_HOUR,DEFRAG_MAX_WHOLE_MACHINES
ENABLE_RUNTIME_CONFIG = TRUE
## Which machines are more desirable to drain
DEFRAG_RANK = ifThenElse(Cpus >= 8, -10, (TotalCpus - Cpus)/(8.0 - Cpus))
# Definition of a "whole" machine:
# - anything with 8 cores (since multicore jobs only need 8 cores, don't
need to drain whole machines with > 8 cores)
# - must be configured to actually start new jobs (otherwise machines
which are deliberately being drained will be included)
#DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8))
#DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) &&
StartJobs =?= True && CPPMNodeOnline =?= True
DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) &&
StartJobs =!= True && CPPMNodeOnline =!= True
# Decide which machines to drain
# - must not be cloud machines
# - must be healthy
# - must be configured to actually start new jobs
DEFRAG_REQUIREMENTS = PartitionableSlot && StartJobs =?= True &&
CPPMNodeOnline =?= True
## Logs
MAX_DEFRAG_LOG = 104857600
MAX_NUM_DEFRAG_LOG = 10
#The following command may be used to view the condor_defrag daemon ClassAd:
#condor_status -l -any -constraint 'MyType == "Defrag"'
--
--------------------------------------------------------------
Edith Knoops
CPPM/CNRS Mail: knoops@xxxxxxxxxxxxx
163 Av de Luminy case 902 Tel : (+33) (0)4 91 82 72 02
13288 Marseille Cedex 9 France
--------------------------------------------------------------