[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] after setting JOB_TRANSFORM in the schedd, jobs are not picked up by hosts



Hi TJ,

thanks a lot. Quite useful, indeed.

Cheers,
Jose

El mar, 24 nov 2020 a las 17:25, John M Knoeller
(<johnkn@xxxxxxxxxxx>) escribiÃ:
>
> It's hard to know for sure, but it looks like your ONLY_LHCB slots are just busy running other jobs.
>
> You can see that here
>
>         2399.000:  Run analysis summary ignoring user priority.  Of
> 15795 machines,
>           15126 are rejected by your job's requirements
>             104 reject your job because of their own requirements
> -->>>    565 are exhausted partitionable slots
>
> And also here
>
>         -- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
>         2399.0: Analyzing matches for 1 job
>                               Slot    Slot's Req      Job's Req     Both
>         Name                  Type    Matches Job   Matches Slot    Match %
>         -------------------   -----   ------------   ------------ ----------
>         slot1@machine01       Part               0              1       0.00
>         slot1_10@machine01    Dyn                0              1       0.00
>         .....
>
> It looks like the Dynamic slots are too small to match, and the p-slot is out of resources so it doesn't match either.
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of jcaballero.hep@xxxxxxxxx
> Sent: Monday, November 23, 2020 3:57 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] after setting JOB_TRANSFORM in the schedd, jobs are not picked up by hosts
>
> Hi,
>
> This is a follow up from the previous thread "Multiple
> JOB_TRANSFORMATION blocks not working on a schedd".
> I think I fixed the configuration at the schedd. But still matchmaking
> is not working for me as I need.
>
> I am trying to make the matchmaker to force a certain type of jobs
> (those from VO lhcb) to run only on a selected set of machines.
> In order to do this, I have added a special classads to those
> machines, as in [1],
> and I am trying to add that classad in the Requirements expression of
> these jobs, via JOB_TRANSFORM [2], as we discussed in the other
> thread.
> Jobs look like this [3].
>
> However, those jobs are not being picked up.
>
> Those machines are currently busy running other jobs,
> but I thought that would be indicated in the output of condor_q
> -analyze, something like
>
>             N match but are serving other users
>
> That is not the case [4].
> It says that 106 slots matched, and yet there is no successful matching.
>
> Adding -reverse -machine options seems to indicate that the issue is
> that the jobs don't meet some requirements from the machine [5].
> That surprises me a little bit, since I do not remove or overwrite any
> job attribute in the second JOB_TRANSFORM block, except Requirements.
> Indeed, the same command on a production schedd, against a job that is
> currently running, also gives me zeros in the "Slot's Req Matches Job"
> column.
> So I may not be understanding what it means....
>
> Any tip on how to troubleshoot this lack of matching is more than welcome.
>
> Thanks a lot in advance.
> Cheers,
> Jose
>
> ======================================================================
>
> [1]
>
>         [root@machine01 ~]# condor_config_val -dump | grep STARTD_ATTR
>         STARTD_ATTRS =  <... other attributes...>, ShouldHibernate,
> PREEMPTABLE_ONLY, StartJobs, EFFICIENT_DRAIN, KILL_SIGNAL, ONLY_LHCB
>         SYSTEM_STARTD_ATTRS = COLLECTOR_HOST_STRING DedicatedScheduler
>
>         [root@lcg1863 config.d]# condor_config_val ONLY_LHCB
>         True
>
> ======================================================================
>
> [2]
>
>         JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES), DefaultDocker, forcelhcb
>
>         JOB_TRANSFORM_DefaultDocker @=end
>         [
>            Requirements = JobUniverse == 5 && DockerImage =?=
> undefined && Owner =!= "nagios";
>            set_WantDocker = true;
>            set_Requirements = ( TARGET.HasDocker ) && ( TARGET.Disk >=
> RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus
> >= RequestCpus ) && ( TARGET.HasFileTransfer ) && (
> x509UserProxyVOName =?= "atlas" && NumJobStarts == 0 ||
> x509UserProxyVOName =!= "atlas");
>            copy_TransferInput = "OriginalTransferInput";
>            eval_set_TransferInput = strcat(OriginalTransferInput, ",", Cmd);
>            set_PeriodicRemove = ( (RemoteUserCpu + RemoteSysCpu >
> JobCpuLimit) ?: False ) || ( (RemoteWallClockTime > JobTimeLimit) ?:
> False )
>         ]
>         @end
>
>
>         JOB_TRANSFORM_forcelhcb @=end
>         [
>            Requirements = JobUniverse == 5 && x509UserProxyVOName ==
> "lhcb" && ScheddHostName == "ce-test";
>            set_Requirements = TARGET.ONLY_LHCB;
>         ]
>         @end
>
>
> ======================================================================
>
> [3]
>
>         [root@ce-test ~]# condor_q -l 2399.0 | grep ^Requirements
>         Requirements = TARGET.ONLY_LHCB
>
> ======================================================================
>
> [4]
>
>
>         [root@ce-test ~]# condor_q -better-analyze 2399.0
>
>         -- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
>         The Requirements expression for job 2399.000 is
>
>             TARGET.ONLY_LHCB
>
>         Job 2399.000 defines the following attributes:
>
>
>         The Requirements expression for job 2399.000 reduces to these
> conditions:
>
>                  Slots
>         Step    Matched  Condition
>         -----  --------  ---------
>         [0]         106  TARGET.ONLY_LHCB
>
>         No successful match recorded.
>         Last failed match: Mon Nov 23 08:56:10 2020
>
>         Reason for last match failure: no match found
>
>         2399.000:  Run analysis summary ignoring user priority.  Of
> 15795 machines,
>           15126 are rejected by your job's requirements
>             104 reject your job because of their own requirements
>             565 are exhausted partitionable slots
>               0 match and are already running your jobs
>               0 match but are serving other users
>               0 are available to run your job
>
>         WARNING:  Be advised:
>            Job did not match any machines's constraints
>            To see why, pick a machine that you think should match and add
>              -reverse -machine <name>
>            to your query.
>
>
> ======================================================================
>
> [5]
>
>         [root@ce-test ~]# condor_q -analyze -reverse -machine machine01 2399.0
>
>
>         -- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
>         2399.0: Analyzing matches for 1 job
>                               Slot    Slot's Req      Job's Req     Both
>         Name                  Type    Matches Job   Matches Slot    Match %
>         -------------------   -----   ------------   ------------ ----------
>         slot1@machine01       Part               0              1       0.00
>         slot1_10@machine01    Dyn                0              1       0.00
>         slot1_11@machine01    Dyn                0              1       0.00
>         slot1_12@machine01    Dyn                0              1       0.00
>         slot1_14@machine01    Dyn                0              1       0.00
>         slot1_15@machine01    Dyn                0              1       0.00
>         slot1_16@machine01    Dyn                0              1       0.00
>         slot1_18@machine01    Dyn                0              1       0.00
>         slot1_19@machine01    Dyn                0              1       0.00
>         slot1_1@machine01     Dyn                0              1       0.00
>         slot1_20@machine01    Dyn                0              1       0.00
>         slot1_21@machine01    Dyn                0              1       0.00
>         slot1_22@machine01    Dyn                0              1       0.00
>         slot1_24@machine01    Dyn                0              1       0.00
>         slot1_26@machine01    Dyn                0              1       0.00
>         slot1_29@machine01    Dyn                0              1       0.00
>         slot1_2@machine01     Dyn                0              1       0.00
>         slot1_30@machine01    Dyn                0              1       0.00
>         slot1_32@machine01    Dyn                0              1       0.00
>         slot1_3@machine01     Dyn                0              1       0.00
>         slot1_5@machine01     Dyn                0              1       0.00
>         slot1_9@machine01     Dyn                0              1       0.00
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/