Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs massively killed with PeriodicRemove

Date: Thu, 24 Mar 2022 10:15:00 +0100
From: Edith Knoops <knoops@xxxxxxxxxxxxx>
Subject: [HTCondor-users] Jobs massively killed with PeriodicRemove

Hi,

We have a problem with our WLCG T2 IN2P3-CPPM. After activating defrag,jobs are killed with "The job attribute PeriodicRemove expression'(JobStatus == 1 && NumJobStarts > 0)' evaluated to TRUE"


We use DEFRAG_SCHEDULE = graceful . Full config file below 1).

The condor version was 8.8 until yesterday where we upgrade to 9.0.11.The problem is the same with both version.


If we deactivate defrag, the problem disappear.

NB: the command condor_drainÂ also killed all the running jobs

Have you an idea how we can have defrag working with a "real" gracefulldrain ?

We need defrag to have enoughÂ 8 core jobs of ATLAS with LHCB jobsrunning also.


Thanks


Edith & Carlos


1)

DAEMON_LIST = $(DAEMON_LIST) DEFRAG

DEFRAG_INTERVAL = 300
#DEFRAG_DRAINING_MACHINES_PER_HOUR = 30.0
DEFRAG_DRAINING_MACHINES_PER_HOUR = 30.0
#DEFRAG_MAX_CONCURRENT_DRAINING = 60
DEFRAG_MAX_CONCURRENT_DRAINING = 40
#DEFRAG_MAX_WHOLE_MACHINES = 300
DEFRAG_MAX_WHOLE_MACHINES = 90
DEFRAG_SCHEDULE = graceful

## Allow some defrag configuration to be settable

DEFRAG.SETTABLE_ATTRS_ADMINISTRATOR =DEFRAG_MAX_CONCURRENT_DRAINING,DEFRAG_DRAINING_MACHINES_PER_HOUR,DEFRAG_MAX_WHOLE_MACHINES

ENABLE_RUNTIME_CONFIG = TRUE

## Which machines are more desirable to drain
DEFRAG_RANK = ifThenElse(Cpus >= 8, -10, (TotalCpus - Cpus)/(8.0 - Cpus))

# Definition of a "whole" machine:

# - anything with 8 cores (since multicore jobs only need 8 cores, don'tneed to drain whole machines with > 8 cores)# - must be configured to actually start new jobs (otherwise machineswhich are deliberately being drained will be included)


#DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8))

#DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) &&StartJobs =?= True && CPPMNodeOnline =?= True

DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) &&StartJobs =!= True && CPPMNodeOnline =!= True


# Decide which machines to drain
# - must not be cloud machines
# - must be healthy
# - must be configured to actually start new jobs

DEFRAG_REQUIREMENTS = PartitionableSlot && StartJobs =?= True &&CPPMNodeOnline =?= True


## Logs
MAX_DEFRAG_LOG = 104857600
MAX_NUM_DEFRAG_LOG = 10

#The following command may be used to view the condor_defrag daemon ClassAd:

#condor_status -l -any -constraint 'MyType == "Defrag"'



--
--------------------------------------------------------------
Edith Knoops
CPPM/CNRS    	                  Mail: knoops@xxxxxxxxxxxxx
163 Av de Luminy case 902         Tel : (+33) (0)4 91 82 72 02
13288 Marseille Cedex 9 France
--------------------------------------------------------------

Follow-Ups:
- Re: [HTCondor-users] Jobs massively killed with PeriodicRemove
  - From: Beyer, Christoph

Prev by Date: Re: [HTCondor-users] [ExternalEmail] Re: Windows credd, pool_password, run_as_owner all working, but not with Encrypt_Execute_Directory
Next by Date: Re: [HTCondor-users] Jobs massively killed with PeriodicRemove
Previous by thread: [HTCondor-users] HTCondor Week 2022: An Opportunity to Learn and Connect
Next by thread: Re: [HTCondor-users] Jobs massively killed with PeriodicRemove
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] Jobs massively killed with PeriodicRemove