[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?



Hi,

you might come up with a ranking expression for longer running jobs to prefer hosts that have the longest time to live which presumably should cluster these jobs on a couple of machines ? 

Best
christoph

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "Carsten Aulbert" <carsten.aulbert@xxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 2. September 2021 09:07:52
Betreff: Re: [HTCondor-users] peaceful DEFRAG_SCHEDULE?

Hi Greg, all

(embarrassingly only replying now after forgetting about it for too long)

On 16.08.21 17:30, Greg Thain wrote:
> SIGTERM won't be set to any job whose runtime is less than 
> MaxJobRetirementTime with a "graceful" shutdown/drain.
which basically means, I could achieve a "peaceful" defrag if I simply 
set MaxJobRetirementTime to near infinite, right?

In the end, I am still torn between having users with extremely long 
running jobs (many cores for 10-20 days) and users wanting to condor to 
finally match their few hour 80+ core jobs, for which is presumably need 
condor_defrag to free up slots that large.

Then just four quick questions (as these are deviating more and more 
from the original question asked, I can write-up additional emails for 
the list/archives if wanted):

(1) The only middle ground with condor_defrag I currently see is that we 
take a number of large core count machines, configure 
MaxJobRetirementTime to something we see reasonable  (along with 
MaxVacateTime for the few jobs which would react to that), let 
condor_defrag only act on these machines and add a start expression that 
it will take only jobs which use a certain flag/setting in the submit 
file - just to prevent very long user jobs matching there.

(2) Is there a way for condor_defrag to discover if a parallel universe 
job is running on a partitionable slot and then NOT consider it suitable 
for defrag? As DEFRAG_REQUIREMENTS "only" match against the startd I 
don't think it can look into the slots partitioned off or can it?

(3) only slightly related: when performing some changes on the 
nodes/condor configuration, we usually let the node run dry via 
condor_off -peaceful -startd. That works, however, given some jobs are 
running a really long time, we sometimes simply lose track of these 
nodes and they simply vanish from condor_status as these are not 
available anymore. Is there already a mechanism these not could somehow 
send a notification they are empty?

   Or would it be better to set START=FALSE via condor_configval stead 
and then monitor nodes where TotalCpus==Cpus?

(4) Is there a downside using "use feature : GPUs" on non-GPU nodes? AS 
we have a mix of GPU and non-GPU hosts, right now writing condor_status 
constraints is much more cumbersome if you need to allow for nodes not 
having set TotalGpus. Doing this on a test node has not really shown 
much of a downside, but maybe there is something hidden which could be a 
road block later on?

The goal is to get around the "undefined" for something simple as
   condor_status -const PartitionableSlot -af TotalGpus==0|sort|uniq
false
true
undefined

Cheers

Carsten

-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany, Phone +49 511 762 17185



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/