[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Why jobs are not being picked anymore?



I'm guessing that node health is your problem.  look at the bottom of the output.

[0]           0  NODE_IS_HEALTHY is true

It says that 0 of the nodes that matched earlier clauses satisfy this requirement clause

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of jcaballero.hep@xxxxxxxxx
Sent: Tuesday, June 9, 2020 10:03 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Why jobs are not being picked anymore?

Hi,

I have another one of my weird questions :)
Here is the situation:

* I want certain jobs on a given schedd to run on a given set of
machines, and I don't want those machines to run anything else.
In order to achieve that, I have this configuration on the startds [1]
and this on the schedd [2].

* indeed, I can see the requirement in the IDLE jobs [3]

* it has been working fine for a while. However, since a few days ago,
jobs stay IDLE forever.
And better-analyze claims there are no host that could run the jobs as
far as I undertand [4].

Do you see any clue in the output of [5] that could help me to
understand why they don't run?
Is anything I could check in the Central Manager to troubleshoot?

Thanks a lot in advance.
Cheers,
Jose



[1]
[root @ startd ~]# rpm -qa | grep condor
condor-8.6.13-1.el7.x86_64
condor-classads-8.6.13-1.el7.x86_64
condor-procd-8.6.13-1.el7.x86_64
condor-external-libs-8.6.13-1.el7.x86_64
condor-python-8.6.13-1.el7.x86_64
tier1-condor-wn-healthcheck-1.10-1.x86_64
mjf-htcondor-00.14-1.noarch
tier1-condor-docker-1.6.4-1.noarch

[root @ startd ~]# condor_config_val STARTD_ATTRS
 RalCluster, RalSnapshot, RalBranchName, RalBranchType, ScalingFactor,
StartJobs, ShouldHibernate, PREEMPTABLE_ONLY, StartJobs,
EFFICIENT_DRAIN, KILL_SIGNAL, ONLY_LHCB

[root @ startd ~]# condor_config_val ONLY_LHCB
True

[2]
[root@schedd ~]# rpm -qa | grep condor
condor-procd-8.6.13-1.el6.x86_64
condor-classads-8.6.13-1.el6.x86_64
condor-python-8.6.13-1.el6.x86_64
condor-external-libs-8.6.13-1.el6.x86_64
condor-8.6.13-1.el6.x86_64

[root@schedd ~]# condor_config_val JOB_TRANSFORM_NAMES
, forcelhcb, DefaultDocker

[root@schedd ~]# condor_config_val JOB_TRANSFORM_forcelhcb
[
Requirements = JobUniverse == 5 && DockerImage =?= undefined && Owner
=!= "nagios" && x509UserProxyVOName == "lhcb";
set_Transformed = "forcelhcb";
set_WantDocker = true;
eval_set_DockerImage = ifThenElse(NordugridQueue =?= "EL7",
"stfc/grid-workernode-c7:2019-07-02.1",
"stfc/grid-workernode-c6:2019-07-02.1");
set_Requirements = TARGET.ONLY_LHCB;
copy_TransferInput = "OriginalTransferInput";
eval_set_TransferInput = strcat(OriginalTransferInput, ",", Cmd);
]

[3]
[root@schedd ~]# condor_q 6184050.0 -format '%s\n' Requirements
TARGET.ONLY_LHCB

[4]
[root@schedd ~]# condor_q -better-analyze 6184050.0


-- Schedd: xxxxx : <130.246.182.180:17549>
The Requirements expression for job 6184050.000 is

    TARGET.ONLY_LHCB

Job 6184050.000 defines the following attributes:


The Requirements expression for job 6184050.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           6  TARGET.ONLY_LHCB


6184050.000:  Run analysis summary ignoring user priority.  Of 13097 machines,
  12529 are rejected by your job's requirements
      6 reject your job because of their own requirements
    562 are exhausted partitionable slots
      0 match and are already running your jobs
      0 match but are serving other users
      0 are available to run your job

WARNING:  Be advised:
   Job did not match any machines's constraints
   To see why, pick a machine that you think should match and add
     -reverse -machine <name>
   to your query.

[5]
[root@schedd ~]# condor_q -better-analyze 6184050.0 -reverse -machine <startd>


-- Schedd: <schedd> : <130.246.182.180:17549>

-- Slot: slot1@<startd> : Analyzing matches for 1 Jobs in 1 autoclusters

The Requirements expression for this slot is

    ( START ) && ( IsValidCheckpointPlatform ) &&
            ( WithinResourceLimits )

  START is
    ( NODE_IS_HEALTHY is true ) &&
            ( StartJobs is true ) && ( RecentJobStarts < 20 ) &&
            ( x509UserProxyVOName is "lhcb" ) &&
            ( ScheddHostName is "<schedd>" ) &&
            ( ( UtsnameRelease is "5.3.1-1.el7.elrepo.x86_64" ) ||
                  ( x509UserProxyVOName is "lsst" ) ) &&
            ifThenElse(Offline is undefined,true,( ( CurrentTime -
QDate ) >= 900 )) &&
            ifThenElse(false,isPreemptable is true,true) &&
        ( false == false )

  IsValidCheckpointPlatform is
    ( TARGET.JobUniverse isnt 1 ||
            ( ( MY.CheckpointPlatform isnt undefined ) &&
                ( ( TARGET.LastCheckpointPlatform is MY.CheckpointPlatform ) ||
                    ( TARGET.NumCkpts == 0 ) ) ) )

  WithinResourceLimits is
    ( ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 &&
        TARGET._condor_RequestCpus <=
MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 &&
          TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) &&
      ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory > 0 &&
        TARGET._condor_RequestMemory <=
MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0
&&
          TARGET.RequestMemory <= MY.Memory,false)) &&
      ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 &&
        TARGET._condor_RequestDisk <=
MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 &&
          TARGET.RequestDisk <= MY.Disk,false)) )

This slot defines the following attributes:

    CheckpointPlatform = "LINUX X86_64 5.3.1-1.el7.elrepo.x86_64
normal N/A avx avx2 ssse3 sse4_1 sse4_2"
    Cpus = 128
    Disk = 352654212
    Memory = 696600
    NODE_IS_HEALTHY = true && true && ( WantEchoXrootd =?= false ||
WantEchoXrootd =?= undefined )
    RecentJobStarts = 0
    StartJobs = true
    UtsnameRelease = "5.3.1-1.el7.elrepo.x86_64"

Job 6184050.0 has the following attributes:

    TARGET.QDate = 1591704386
    TARGET.ScheddHostName = "<schedd>"
    TARGET.JobUniverse = 5
    TARGET.NumCkpts = 0
    TARGET.RequestCpus = 1
    TARGET.RequestDisk = 75
    TARGET.RequestMemory = 4000
    TARGET.WantEchoXrootd = true
    TARGET.x509UserProxyVOName = "lhcb"

The Requirements expression for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[0]           0  NODE_IS_HEALTHY is true

slot1@<startd>: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.
    1 have job requirements that match this slot.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/