[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs massively killed with PeriodicRemove



Hi,

We have a problem with our WLCG T2 IN2P3-CPPM. After activating defrag, jobs are killed with "The job attribute PeriodicRemove expression '(JobStatus == 1 && NumJobStarts > 0)' evaluated to TRUE"

We use DEFRAG_SCHEDULE = graceful . Full config file below 1).

The condor version was 8.8 until yesterday where we upgrade to 9.0.11. The problem is the same with both version.

If we deactivate defrag, the problem disappear.

NB: the command condor_drain also killed all the running jobs

Have you an idea how we can have defrag working with a "real" gracefull drain ?

We need defrag to have enough 8 core jobs of ATLAS with LHCB jobs running also.

Thanks


Edith & Carlos


1)

DAEMON_LIST = $(DAEMON_LIST) DEFRAG

DEFRAG_INTERVAL = 300
#DEFRAG_DRAINING_MACHINES_PER_HOUR = 30.0
DEFRAG_DRAINING_MACHINES_PER_HOUR = 30.0
#DEFRAG_MAX_CONCURRENT_DRAINING = 60
DEFRAG_MAX_CONCURRENT_DRAINING = 40
#DEFRAG_MAX_WHOLE_MACHINES = 300
DEFRAG_MAX_WHOLE_MACHINES = 90
DEFRAG_SCHEDULE = graceful

## Allow some defrag configuration to be settable
DEFRAG.SETTABLE_ATTRS_ADMINISTRATOR = DEFRAG_MAX_CONCURRENT_DRAINING,DEFRAG_DRAINING_MACHINES_PER_HOUR,DEFRAG_MAX_WHOLE_MACHINES
ENABLE_RUNTIME_CONFIG = TRUE

## Which machines are more desirable to drain
DEFRAG_RANK = ifThenElse(Cpus >= 8, -10, (TotalCpus - Cpus)/(8.0 - Cpus))

# Definition of a "whole" machine:
# - anything with 8 cores (since multicore jobs only need 8 cores, don't need to drain whole machines with > 8 cores) # - must be configured to actually start new jobs (otherwise machines which are deliberately being drained will be included)

#DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8))
#DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) && StartJobs =?= True && CPPMNodeOnline =?= True

DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) && StartJobs =!= True && CPPMNodeOnline =!= True

# Decide which machines to drain
# - must not be cloud machines
# - must be healthy
# - must be configured to actually start new jobs
DEFRAG_REQUIREMENTS = PartitionableSlot && StartJobs =?= True && CPPMNodeOnline =?= True

## Logs
MAX_DEFRAG_LOG = 104857600
MAX_NUM_DEFRAG_LOG = 10

#The following command may be used to view the condor_defrag daemon ClassAd:

#condor_status -l -any -constraint 'MyType == "Defrag"'



--
--------------------------------------------------------------
Edith Knoops
CPPM/CNRS    	                  Mail: knoops@xxxxxxxxxxxxx
163 Av de Luminy case 902         Tel : (+33) (0)4 91 82 72 02
13288 Marseille Cedex 9 France
--------------------------------------------------------------