Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?

Date: Thu, 02 Sep 2021 09:55:35 -0500
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?

On 9/2/21 2:07 AM, Carsten Aulbert wrote:

which basically means, I could achieve a "peaceful" defrag if I simplyset MaxJobRetirementTime to near infinite, right?



That's correct.

In the end, I am still torn between having users with extremely longrunning jobs (many cores for 10-20 days) and users wanting to condorto finally match their few hour 80+ core jobs, for which is presumablyneed condor_defrag to free up slots that large.

Then just four quick questions (as these are deviating more and morefrom the original question asked, I can write-up additional emails forthe list/archives if wanted):
(1) The only middle ground with condor_defrag I currently see is thatwe take a number of large core count machines, configureMaxJobRetirementTime to something we see reasonableÂ (along withMaxVacateTime for the few jobs which would react to that), letcondor_defrag only act on these machines and add a start expressionthat it will take only jobs which use a certain flag/setting in thesubmit file - just to prevent very long user jobs matching there.

Note that MaxJobRetirementTime is an expression that can look atattributes of the running job on the machine.Â Here in the CHTC atWisconsin, we allow users to declare (with a +LongJob=true customattribute in their job submit file) that their job may needlonger-than-usual walltime, and MaxJobRetirementTime honors that. As atradeoff, there are a lot of machines these jobs won't match with.

(2) Is there a way for condor_defrag to discover if a paralleluniverse job is running on a partitionable slot and then NOT considerit suitable for defrag? As DEFRAG_REQUIREMENTS "only" match againstthe startd I don't think it can look into the slots partitioned off orcan it?

I haven't tested this, but out of the box, the JobUniverse of a runningjob is advertised in the dynamic slot, but not the partitionable slottoday.Â Some attributes of the dynamic slots are "rolled up" into anarray in the partitionable slot into a classad array named childXXX.ÂThere is a startd knob, STARTD_PARTITIONABLE_SLOT_ATTRS, which addsattributes to this set.Â I think you could add


STARTD_PARTITIONABLE_SLOT_ATTRS = JobUniverse

and then the partitionable slot would get a classad array namedchildJobUniverse, with a value of the jobUniverses for each of thedynamic slots.Â You could then use this attribute in the defragrequirements expression.

(4) Is there a downside using "use feature : GPUs" on non-GPU nodes?AS we have a mix of GPU and non-GPU hosts, right now writingcondor_status constraints is much more cumbersome if you need to allowfor nodes not having set TotalGpus. Doing this on a test node has notreally shown much of a downside, but maybe there is something hiddenwhich could be a road block later on?

The only downside to having "use feature: GPUs" on at all times thatI've seen is when you have machines with GPUs that you don't really wantto use, or can't use (like desktops with onboard GPUs).



-greg

References:
- Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?
  - From: Carsten Aulbert

Prev by Date: [HTCondor-users] HTCondor - Windows paths with spaces
Next by Date: Re: [HTCondor-users] HTCondor-CE: Setting Default limits
Previous by thread: Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?
Next by thread: [HTCondor-users] HTCondor - Windows paths with spaces
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?