[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs massively killed with PeriodicRemove



Hi,

where did you define the system periodic remove expression, it does actually say remove jobs that are idle and have not started yet which is pretty much the definition of an idle job ;) ?

This might make sense if you want to lower the idle job queue to 'near to 0' and only accept jobs that more or less start in the same second - still a weird approach this would be :)

Best
christoph

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "Edith Knoops" <knoops@xxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 24. MÃrz 2022 10:15:00
Betreff: [HTCondor-users] Jobs massively killed with PeriodicRemove

Hi,

We have a problem with our WLCG T2 IN2P3-CPPM. After activating defrag, 
jobs are killed with "The job attribute PeriodicRemove expression 
'(JobStatus == 1 && NumJobStarts > 0)' evaluated to TRUE"

We use DEFRAG_SCHEDULE = graceful . Full config file below 1).

The condor version was 8.8 until yesterday where we upgrade to 9.0.11. 
The problem is the same with both version.

If we deactivate defrag, the problem disappear.

NB: the command condor_drain also killed all the running jobs

Have you an idea how we can have defrag working with a "real" gracefull 
drain ?

We need defrag to have enough 8 core jobs of ATLAS with LHCB jobs 
running also.

Thanks


Edith & Carlos


1)

DAEMON_LIST = $(DAEMON_LIST) DEFRAG

DEFRAG_INTERVAL = 300
#DEFRAG_DRAINING_MACHINES_PER_HOUR = 30.0
DEFRAG_DRAINING_MACHINES_PER_HOUR = 30.0
#DEFRAG_MAX_CONCURRENT_DRAINING = 60
DEFRAG_MAX_CONCURRENT_DRAINING = 40
#DEFRAG_MAX_WHOLE_MACHINES = 300
DEFRAG_MAX_WHOLE_MACHINES = 90
DEFRAG_SCHEDULE = graceful

## Allow some defrag configuration to be settable
DEFRAG.SETTABLE_ATTRS_ADMINISTRATOR = 
DEFRAG_MAX_CONCURRENT_DRAINING,DEFRAG_DRAINING_MACHINES_PER_HOUR,DEFRAG_MAX_WHOLE_MACHINES
ENABLE_RUNTIME_CONFIG = TRUE

## Which machines are more desirable to drain
DEFRAG_RANK = ifThenElse(Cpus >= 8, -10, (TotalCpus - Cpus)/(8.0 - Cpus))

# Definition of a "whole" machine:
# - anything with 8 cores (since multicore jobs only need 8 cores, don't 
need to drain whole machines with > 8 cores)
# - must be configured to actually start new jobs (otherwise machines 
which are deliberately being drained will be included)

#DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8))
#DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) && 
StartJobs =?= True && CPPMNodeOnline =?= True

DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) && 
StartJobs =!= True && CPPMNodeOnline =!= True

# Decide which machines to drain
# - must not be cloud machines
# - must be healthy
# - must be configured to actually start new jobs
DEFRAG_REQUIREMENTS = PartitionableSlot && StartJobs =?= True && 
CPPMNodeOnline =?= True

## Logs
MAX_DEFRAG_LOG = 104857600
MAX_NUM_DEFRAG_LOG = 10

#The following command may be used to view the condor_defrag daemon ClassAd:

#condor_status -l -any -constraint 'MyType == "Defrag"'



-- 
--------------------------------------------------------------
Edith Knoops
CPPM/CNRS    	                  Mail: knoops@xxxxxxxxxxxxx
163 Av de Luminy case 902         Tel : (+33) (0)4 91 82 72 02
13288 Marseille Cedex 9 France
--------------------------------------------------------------

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/