[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] after setting JOB_TRANSFORM in the schedd, jobs are not picked up by hosts



It's hard to know for sure, but it looks like your ONLY_LHCB slots are just busy running other jobs.

You can see that here

        2399.000:  Run analysis summary ignoring user priority.  Of
15795 machines,
          15126 are rejected by your job's requirements
            104 reject your job because of their own requirements
-->>>    565 are exhausted partitionable slots

And also here

        -- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
        2399.0: Analyzing matches for 1 job
                              Slot    Slot's Req      Job's Req     Both
        Name                  Type    Matches Job   Matches Slot    Match %
        -------------------   -----   ------------   ------------ ----------
        slot1@machine01       Part               0              1       0.00
        slot1_10@machine01    Dyn                0              1       0.00
        .....

It looks like the Dynamic slots are too small to match, and the p-slot is out of resources so it doesn't match either.

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of jcaballero.hep@xxxxxxxxx
Sent: Monday, November 23, 2020 3:57 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] after setting JOB_TRANSFORM in the schedd, jobs are not picked up by hosts

Hi,

This is a follow up from the previous thread "Multiple
JOB_TRANSFORMATION blocks not working on a schedd".
I think I fixed the configuration at the schedd. But still matchmaking
is not working for me as I need.

I am trying to make the matchmaker to force a certain type of jobs
(those from VO lhcb) to run only on a selected set of machines.
In order to do this, I have added a special classads to those
machines, as in [1],
and I am trying to add that classad in the Requirements expression of
these jobs, via JOB_TRANSFORM [2], as we discussed in the other
thread.
Jobs look like this [3].

However, those jobs are not being picked up.

Those machines are currently busy running other jobs,
but I thought that would be indicated in the output of condor_q
-analyze, something like

            N match but are serving other users

That is not the case [4].
It says that 106 slots matched, and yet there is no successful matching.

Adding -reverse -machine options seems to indicate that the issue is
that the jobs don't meet some requirements from the machine [5].
That surprises me a little bit, since I do not remove or overwrite any
job attribute in the second JOB_TRANSFORM block, except Requirements.
Indeed, the same command on a production schedd, against a job that is
currently running, also gives me zeros in the "Slot's Req Matches Job"
column.
So I may not be understanding what it means....

Any tip on how to troubleshoot this lack of matching is more than welcome.

Thanks a lot in advance.
Cheers,
Jose

======================================================================

[1]

        [root@machine01 ~]# condor_config_val -dump | grep STARTD_ATTR
        STARTD_ATTRS =  <... other attributes...>, ShouldHibernate,
PREEMPTABLE_ONLY, StartJobs, EFFICIENT_DRAIN, KILL_SIGNAL, ONLY_LHCB
        SYSTEM_STARTD_ATTRS = COLLECTOR_HOST_STRING DedicatedScheduler

        [root@lcg1863 config.d]# condor_config_val ONLY_LHCB
        True

======================================================================

[2]

        JOB_TRANSFORM_NAMES = $(JOB_TRANSFORM_NAMES), DefaultDocker, forcelhcb

        JOB_TRANSFORM_DefaultDocker @=end
        [
           Requirements = JobUniverse == 5 && DockerImage =?=
undefined && Owner =!= "nagios";
           set_WantDocker = true;
           set_Requirements = ( TARGET.HasDocker ) && ( TARGET.Disk >=
RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus
>= RequestCpus ) && ( TARGET.HasFileTransfer ) && (
x509UserProxyVOName =?= "atlas" && NumJobStarts == 0 ||
x509UserProxyVOName =!= "atlas");
           copy_TransferInput = "OriginalTransferInput";
           eval_set_TransferInput = strcat(OriginalTransferInput, ",", Cmd);
           set_PeriodicRemove = ( (RemoteUserCpu + RemoteSysCpu >
JobCpuLimit) ?: False ) || ( (RemoteWallClockTime > JobTimeLimit) ?:
False )
        ]
        @end


        JOB_TRANSFORM_forcelhcb @=end
        [
           Requirements = JobUniverse == 5 && x509UserProxyVOName ==
"lhcb" && ScheddHostName == "ce-test";
           set_Requirements = TARGET.ONLY_LHCB;
        ]
        @end


======================================================================

[3]

        [root@ce-test ~]# condor_q -l 2399.0 | grep ^Requirements
        Requirements = TARGET.ONLY_LHCB

======================================================================

[4]


        [root@ce-test ~]# condor_q -better-analyze 2399.0

        -- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
        The Requirements expression for job 2399.000 is

            TARGET.ONLY_LHCB

        Job 2399.000 defines the following attributes:


        The Requirements expression for job 2399.000 reduces to these
conditions:

                 Slots
        Step    Matched  Condition
        -----  --------  ---------
        [0]         106  TARGET.ONLY_LHCB

        No successful match recorded.
        Last failed match: Mon Nov 23 08:56:10 2020

        Reason for last match failure: no match found

        2399.000:  Run analysis summary ignoring user priority.  Of
15795 machines,
          15126 are rejected by your job's requirements
            104 reject your job because of their own requirements
            565 are exhausted partitionable slots
              0 match and are already running your jobs
              0 match but are serving other users
              0 are available to run your job

        WARNING:  Be advised:
           Job did not match any machines's constraints
           To see why, pick a machine that you think should match and add
             -reverse -machine <name>
           to your query.


======================================================================

[5]

        [root@ce-test ~]# condor_q -analyze -reverse -machine machine01 2399.0


        -- Schedd: arc-ce-test02.gridpp.rl.ac.uk : <130.246.182.100:12237>
        2399.0: Analyzing matches for 1 job
                              Slot    Slot's Req      Job's Req     Both
        Name                  Type    Matches Job   Matches Slot    Match %
        -------------------   -----   ------------   ------------ ----------
        slot1@machine01       Part               0              1       0.00
        slot1_10@machine01    Dyn                0              1       0.00
        slot1_11@machine01    Dyn                0              1       0.00
        slot1_12@machine01    Dyn                0              1       0.00
        slot1_14@machine01    Dyn                0              1       0.00
        slot1_15@machine01    Dyn                0              1       0.00
        slot1_16@machine01    Dyn                0              1       0.00
        slot1_18@machine01    Dyn                0              1       0.00
        slot1_19@machine01    Dyn                0              1       0.00
        slot1_1@machine01     Dyn                0              1       0.00
        slot1_20@machine01    Dyn                0              1       0.00
        slot1_21@machine01    Dyn                0              1       0.00
        slot1_22@machine01    Dyn                0              1       0.00
        slot1_24@machine01    Dyn                0              1       0.00
        slot1_26@machine01    Dyn                0              1       0.00
        slot1_29@machine01    Dyn                0              1       0.00
        slot1_2@machine01     Dyn                0              1       0.00
        slot1_30@machine01    Dyn                0              1       0.00
        slot1_32@machine01    Dyn                0              1       0.00
        slot1_3@machine01     Dyn                0              1       0.00
        slot1_5@machine01     Dyn                0              1       0.00
        slot1_9@machine01     Dyn                0              1       0.00
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/