[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Issues with "WithinResourceLimits" on ARM nodes



Dear all,

We are currently adding some ARM (condor 10.0.9) machines to our cluster (condor 9.0.17). 
We noticed a strange behavior when we submit some arm jobs. 
They seem to fill the machine capacity up to ~50% of the available cores, even if the partitionable slot on that machine has plenty of resources to accept new jobs. 
After some troubleshooting we found out what it seems to be the cause: idle jobs don't match with the WithinResourceLimits _expression_.

Running a reverse check on an idle job on the slot we get the following message:
$ sudo condor_q -n ce02-htc 13177276 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxx


-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9618?...

-- Slot: slot1@xxxxxxxxxxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters

The Requirements _expression_ for this slot is

    START && (WithinResourceLimits)

  START is
    (StartJobs is true) &&
        (TotalLoadAvg < 85) &&
    TARGET.WantARM is true

  WithinResourceLimits is
    (ifThenElse(TARGET._cp_orig_RequestCpus isnt undefined,TARGET.RequestCpus <= MY.Cpus,MY.ConsumptionCpus <= MY.Cpus) &&
      ifThenElse(TARGET._cp_orig_RequestDisk isnt undefined,TARGET.RequestDisk <= MY.Disk,MY.ConsumptionDisk <= MY.Disk) &&
      ifThenElse(TARGET._cp_orig_RequestMemory isnt undefined,TARGET.RequestMemory <= MY.Memory,MY.ConsumptionMemory <= MY.Memory))

This slot defines the following attributes:

    ConsumptionCpus = TARGET.RequestCpus
    ConsumptionDisk = quantize(target.RequestDisk,{ 1024 })
    ConsumptionMemory = quantize(target.RequestMemory,{ 128 })
    Cpus = 120
    Disk = 2020164004
    Memory = 19760
    StartJobs = true && ( !t1_overheat) && (t1_mc_grace)
    t1_mc_grace = ((TARGET.RequestCpus > 1) || ((TARGET.RequestCpus == 1) &&  !(MC_GRACE ?: false)))
    t1_overheat = ((t1_Ambient_T ?: 20) > 30 || ((t1_Inlet_T ?: 40) > 50) || max({ t1_CPU1_T ?: 40,t1_CPU2_T ?: 40 }) > 85) ?: false
    TotalLoadAvg = 0.75

Job 13177276.0 has the following attributes:

    TARGET.RequestCpus = 8
    TARGET.RequestDisk = 100
    TARGET.RequestMemory = 32000
    TARGET.WantARM = true

The Requirements _expression_ for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[0]           1  StartJobs is true
[3]           1  TARGET.WantARM is true
[5]           0  WithinResourceLimits

slot1@xxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.
    1 have job requirements that match this slot.

Checking the _expression_ with job and slot attributes we don't understand why this _expression_ returns false.

Do you have some advice?

Cheers,
Alessandro

Attachment: smime.p7s
Description: S/MIME cryptographic signature