[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Machine classadd inside SYSTEM_PERIODIC_HOLD ?



There is no easy way to add a TARGET ad to the evaluation of SYSTEM_PERIODIC_HOLD,  doing that would require us to change the code in the Schedd, and even then the TARGET would not always be available when SYSTEM_PERIODIC_HOLD was evaluated and would not be updated during the course of the job.  The Schedd only sees the Machine ad once, when it gets a match from the Negotiator. 

 

That does not mean that you can’t do something like this, you will just have to go about it a different way.  if you control the job,

you can use $$() expansion or job_machine_attrs to inject an attribute into the job whose value comes from the machine,

then look at that attribute from the SYSTEM_PERIODIC_HOLD _expression_.

 

Unfortunately, the Disk attribute of a Machine ClassAd is dynamic, and not necessarily updated as frequently as you would need,

so you are better off just tagging the machines that have small disks with a special attribute at config time, something like

this will be more reliably available for $$() expansion.

 

   # add to the config of machines that have small disks

STARTD_ATTRS = $(STARTD_ATTRS) SmallDiskMachine

SmallDiskMachine = true

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of SCHAER Frederic
Sent: Wednesday, June 24, 2020 7:46 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Machine classadd inside SYSTEM_PERIODIC_HOLD ?

 

Hi,

 

We have startd nodes with differing hardware specs and especially disks (different each time we purchase them, that is to say : each year)

On some old nodes, we have 2 x 250GB disks , on latest ones we have 3x1TB disks.

 

In order to prevent machines from jobs, we set a SYSTEM_PERIODIC_HOLD that takes into account the job disk usage, and it’s working well… but it’s very restrictive because of those 250GB nodes.

 

I therefore am trying to set a periodic hold that’s generous where machines have more disks, and more restrictive when they don’t.

Jobs we receive are grid jobs, and there’s no disk requirement that’s sent.

 

I tried to define the hold _expression_ so that it would use the machine disks size to compute a reasonable value that could exceed the per job core limit we usually define (25 GB now per core), but it seems machine classads are not defined while evaluating the hold expressions.

I.E, the shadow debug logs show :

 

#for : MAX_DISK_KB   = debug( TARGET.TotalDisk / TARGET.TotalCpus )

#and then

SYSTEM_PERIODIC_HOLD = \

   (JobStatus == 1 || JobStatus == 2) && ((DiskUsage > $(MAX_DISK_KB) || ResidentSetSize > JobMemoryLimit * 2 ))

 

The shadow logs give :

 

1592996884 2020/06/24 13:08:04 (83.0) (1308254): Classad debug: [0.00215ms] TARGET --> UNDEFINED

1592996884 2020/06/24 13:08:04 (83.0) (1308254): Classad debug: [0.10014ms] TARGET.TotalCpus --> UNDEFINED

1592996884 2020/06/24 13:08:04 (83.0) (1308254): Classad debug: [0.55599ms] TARGET.TotalDisk / TARGET.TotalCpus --> UNDEFINED

 

  • Question : would there be a “simple” way to get this working ? Any way to get Machine classadds available while evaluating SYSTEM_PERIODIC_HOLD ?

 

Off course we could get rid of those 250GB disks and node, but for now we are still coping with them, and we’ll face the same issue over and over again, when new machines have say 3*3TB NVME disks or whatever…

 

Thanks && regards

Frederic