[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Why jobs are not being picked anymore?



Hi,

that seems to be the problem.
Thanks a lot !!

Cheers,
Jose

El jue., 11 jun. 2020 a las 21:54, John M Knoeller
(<johnkn@xxxxxxxxxxx>) escribiÃ:
>
> I'm guessing that node health is your problem.  look at the bottom of the output.
>
> [0]           0  NODE_IS_HEALTHY is true
>
> It says that 0 of the nodes that matched earlier clauses satisfy this requirement clause
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of jcaballero.hep@xxxxxxxxx
> Sent: Tuesday, June 9, 2020 10:03 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] Why jobs are not being picked anymore?
>
> Hi,
>
> I have another one of my weird questions :)
> Here is the situation:
>
> * I want certain jobs on a given schedd to run on a given set of
> machines, and I don't want those machines to run anything else.
> In order to achieve that, I have this configuration on the startds [1]
> and this on the schedd [2].
>
> * indeed, I can see the requirement in the IDLE jobs [3]
>
> * it has been working fine for a while. However, since a few days ago,
> jobs stay IDLE forever.
> And better-analyze claims there are no host that could run the jobs as
> far as I undertand [4].
>
> Do you see any clue in the output of [5] that could help me to
> understand why they don't run?
> Is anything I could check in the Central Manager to troubleshoot?
>
> Thanks a lot in advance.
> Cheers,
> Jose
>
>
>
> [1]
> [root @ startd ~]# rpm -qa | grep condor
> condor-8.6.13-1.el7.x86_64
> condor-classads-8.6.13-1.el7.x86_64
> condor-procd-8.6.13-1.el7.x86_64
> condor-external-libs-8.6.13-1.el7.x86_64
> condor-python-8.6.13-1.el7.x86_64
> tier1-condor-wn-healthcheck-1.10-1.x86_64
> mjf-htcondor-00.14-1.noarch
> tier1-condor-docker-1.6.4-1.noarch
>
> [root @ startd ~]# condor_config_val STARTD_ATTRS
>  RalCluster, RalSnapshot, RalBranchName, RalBranchType, ScalingFactor,
> StartJobs, ShouldHibernate, PREEMPTABLE_ONLY, StartJobs,
> EFFICIENT_DRAIN, KILL_SIGNAL, ONLY_LHCB
>
> [root @ startd ~]# condor_config_val ONLY_LHCB
> True
>
> [2]
> [root@schedd ~]# rpm -qa | grep condor
> condor-procd-8.6.13-1.el6.x86_64
> condor-classads-8.6.13-1.el6.x86_64
> condor-python-8.6.13-1.el6.x86_64
> condor-external-libs-8.6.13-1.el6.x86_64
> condor-8.6.13-1.el6.x86_64
>
> [root@schedd ~]# condor_config_val JOB_TRANSFORM_NAMES
> , forcelhcb, DefaultDocker
>
> [root@schedd ~]# condor_config_val JOB_TRANSFORM_forcelhcb
> [
> Requirements = JobUniverse == 5 && DockerImage =?= undefined && Owner
> =!= "nagios" && x509UserProxyVOName == "lhcb";
> set_Transformed = "forcelhcb";
> set_WantDocker = true;
> eval_set_DockerImage = ifThenElse(NordugridQueue =?= "EL7",
> "stfc/grid-workernode-c7:2019-07-02.1",
> "stfc/grid-workernode-c6:2019-07-02.1");
> set_Requirements = TARGET.ONLY_LHCB;
> copy_TransferInput = "OriginalTransferInput";
> eval_set_TransferInput = strcat(OriginalTransferInput, ",", Cmd);
> ]
>
> [3]
> [root@schedd ~]# condor_q 6184050.0 -format '%s\n' Requirements
> TARGET.ONLY_LHCB
>
> [4]
> [root@schedd ~]# condor_q -better-analyze 6184050.0
>
>
> -- Schedd: xxxxx : <130.246.182.180:17549>
> The Requirements expression for job 6184050.000 is
>
>     TARGET.ONLY_LHCB
>
> Job 6184050.000 defines the following attributes:
>
>
> The Requirements expression for job 6184050.000 reduces to these conditions:
>
>          Slots
> Step    Matched  Condition
> -----  --------  ---------
> [0]           6  TARGET.ONLY_LHCB
>
>
> 6184050.000:  Run analysis summary ignoring user priority.  Of 13097 machines,
>   12529 are rejected by your job's requirements
>       6 reject your job because of their own requirements
>     562 are exhausted partitionable slots
>       0 match and are already running your jobs
>       0 match but are serving other users
>       0 are available to run your job
>
> WARNING:  Be advised:
>    Job did not match any machines's constraints
>    To see why, pick a machine that you think should match and add
>      -reverse -machine <name>
>    to your query.
>
> [5]
> [root@schedd ~]# condor_q -better-analyze 6184050.0 -reverse -machine <startd>
>
>
> -- Schedd: <schedd> : <130.246.182.180:17549>
>
> -- Slot: slot1@<startd> : Analyzing matches for 1 Jobs in 1 autoclusters
>
> The Requirements expression for this slot is
>
>     ( START ) && ( IsValidCheckpointPlatform ) &&
>             ( WithinResourceLimits )
>
>   START is
>     ( NODE_IS_HEALTHY is true ) &&
>             ( StartJobs is true ) && ( RecentJobStarts < 20 ) &&
>             ( x509UserProxyVOName is "lhcb" ) &&
>             ( ScheddHostName is "<schedd>" ) &&
>             ( ( UtsnameRelease is "5.3.1-1.el7.elrepo.x86_64" ) ||
>                   ( x509UserProxyVOName is "lsst" ) ) &&
>             ifThenElse(Offline is undefined,true,( ( CurrentTime -
> QDate ) >= 900 )) &&
>             ifThenElse(false,isPreemptable is true,true) &&
>         ( false == false )
>
>   IsValidCheckpointPlatform is
>     ( TARGET.JobUniverse isnt 1 ||
>             ( ( MY.CheckpointPlatform isnt undefined ) &&
>                 ( ( TARGET.LastCheckpointPlatform is MY.CheckpointPlatform ) ||
>                     ( TARGET.NumCkpts == 0 ) ) ) )
>
>   WithinResourceLimits is
>     ( ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 &&
>         TARGET._condor_RequestCpus <=
> MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 &&
>           TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) &&
>       ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory > 0 &&
>         TARGET._condor_RequestMemory <=
> MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0
> &&
>           TARGET.RequestMemory <= MY.Memory,false)) &&
>       ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 &&
>         TARGET._condor_RequestDisk <=
> MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 &&
>           TARGET.RequestDisk <= MY.Disk,false)) )
>
> This slot defines the following attributes:
>
>     CheckpointPlatform = "LINUX X86_64 5.3.1-1.el7.elrepo.x86_64
> normal N/A avx avx2 ssse3 sse4_1 sse4_2"
>     Cpus = 128
>     Disk = 352654212
>     Memory = 696600
>     NODE_IS_HEALTHY = true && true && ( WantEchoXrootd =?= false ||
> WantEchoXrootd =?= undefined )
>     RecentJobStarts = 0
>     StartJobs = true
>     UtsnameRelease = "5.3.1-1.el7.elrepo.x86_64"
>
> Job 6184050.0 has the following attributes:
>
>     TARGET.QDate = 1591704386
>     TARGET.ScheddHostName = "<schedd>"
>     TARGET.JobUniverse = 5
>     TARGET.NumCkpts = 0
>     TARGET.RequestCpus = 1
>     TARGET.RequestDisk = 75
>     TARGET.RequestMemory = 4000
>     TARGET.WantEchoXrootd = true
>     TARGET.x509UserProxyVOName = "lhcb"
>
> The Requirements expression for this slot reduces to these conditions:
>
>        Clusters
> Step    Matched  Condition
> -----  --------  ---------
> [0]           0  NODE_IS_HEALTHY is true
>
> slot1@<startd>: Run analysis summary of 1 jobs.
>     0 (0.00 %) match both slot and job requirements.
>     0 match the requirements of this slot.
>     1 have job requirements that match this slot.
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/