[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Nodes not taking jobs although both have matching resources/requests



Hi,

I am looking for ideas, why two nodes have troubles accepting/starting
jobs. Both nodes have been recently spawned defined as
  DEV_RESOURCE = true
- on which no other nodes in the cluster match.

But all my jobs with 'DEV_RESOURCE' as only requirement do not start -
although the jobs request [1.a] and nodes resources [1.b] match 'in
principal' - with a slot and a job being found.

The nominal group's share (aka OTHER) should be sufficient (and also no
other user/group's job is matching the nodes' resources, i.e., the nodes
are idling).

The negotiator rejects the jobs as it cannot find a match [2] - where I
am convinced that it should match (comparing the nodes' ads with the
request it should(?) fit) (with the same info ending up at the scheduler
[3])

Maybe somebody has a hint for me, why the matchmaking might be failing
here??

Cheers,
  Thomas

[1.a]
> condor_q -better-analyze 55.0


-- Schedd: grid-vm08.desy.de : <131.169.223.234:9620?...
The Requirements expression for job 55.000 is

    (TARGET.DEV_RESOURCE) && (TARGET.Arch == "X86_64") && (TARGET.OpSys
== "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >=
RequestMemory) && (TARGET.HasFileTransfer)

Job 55.000 defines the following attributes:

    DiskUsage = 3
    RequestDisk = DiskUsage
    RequestMemory = 2500

The Requirements expression for job 55.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           2  TARGET.DEV_RESOURCE

No successful match recorded.
Last failed match: Wed Apr 11 11:39:52 2018

Reason for last match failure: no match found

055.000:  Run analysis summary ignoring user priority.  Of 353 machines,
    351 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      2 are able to run your job

[1.b]
>  condor_q -better-analyze 55.0  -reverse -machine wn12-test.desy.de


-- Schedd: grid-vm08.desy.de : <131.169.223.234:9620?...

-- Slot: slot1@xxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1
autoclusters

The Requirements expression for this slot is

    (START) && (IsValidCheckpointPlatform) &&
            (WithinResourceLimits)

  START is
    (NODE_IS_HEALTHY is true) &&
            (StartJobs is true)

  IsValidCheckpointPlatform is
    (TARGET.JobUniverse isnt 1 ||
            ((MY.CheckpointPlatform isnt undefined) &&
                ((TARGET.LastCheckpointPlatform is MY.CheckpointPlatform) ||
                    (TARGET.NumCkpts == 0))))

  WithinResourceLimits is
    (ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 &&
        TARGET._condor_RequestCpus <=
MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 &&
          TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) &&
      ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory >
0 &&
        TARGET._condor_RequestMemory <=
MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0 &&
          TARGET.RequestMemory <= MY.Memory,false)) &&
      ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 &&
        TARGET._condor_RequestDisk <=
MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 &&
          TARGET.RequestDisk <= MY.Disk,false)))

This slot defines the following attributes:

    CheckpointPlatform = "LINUX X86_64 3.10.0-693.21.1.el7.x86_64 normal
N/A ssse3 sse4_1 sse4_2"
    Cpus = 16
    Disk = 68089928
    Memory = 48124
    NODE_IS_HEALTHY = true
    StartJobs = true

Job 55.0 has the following attributes:

    TARGET.JobUniverse = 5
    TARGET.NumCkpts = 0
    TARGET.RequestCpus = 1
    TARGET.RequestDisk = 3
    TARGET.RequestMemory = 2500

The Requirements expression for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[3]           1  IsValidCheckpointPlatform
[5]           1  WithinResourceLimits

slot1@xxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    1 (100.00 %) match both slot and job requirements.
    1 match the requirements of this slot.
    1 have job requirements that match this slot.


[2]
> NegotiatorLog
...
04/11/18 12:18:44 ---------- Started Negotiation Cycle ----------
04/11/18 12:18:44 Phase 1:  Obtaining ads from collector ...
04/11/18 12:18:44   Getting startd private ads ...
04/11/18 12:18:45   Getting Scheduler, Submitter and Machine ads ...
04/11/18 12:18:50   Sorting 11071 ads ...
04/11/18 12:18:50 Got ads: 11071 public and 11021 private
04/11/18 12:18:50 Public ads include 25 submitter, 11021 startd
04/11/18 12:18:51 Phase 2:  Performing accounting ...
...
04/11/18 12:18:53 group quotas: WARNING: dynamic quota for group
group_OPS rescaled from 0.9 to 0.321429
04/11/18 12:18:53 group quotas: WARNING: dynamic quota for group
group_OTHER rescaled from 0.1 to 0.0357143
04/11/18 12:18:53 group quotas: allocation round 1
04/11/18 12:18:53 group quotas: groups= 9  requesting= 5  served= 5
unserved= 0  slots= 10911  requested= 25736  allocated= 25736  surplus=
3422  maxdelta= 6842
04/11/18 12:18:53 group quotas: entering RR iteration n= 6842
...
04/11/18 12:18:53 Group group_OPS - skipping, zero slots allocated
04/11/18 12:18:53 Group group_OTHER - BEGIN NEGOTIATION
04/11/18 12:18:53 Phase 3:  Sorting submitter ads by priority ...
04/11/18 12:18:53 Phase 4.1:  Negotiating with schedds ...
04/11/18 12:18:53   Negotiating with group_OTHER.other.grid@xxxxxxx at
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>
04/11/18 12:18:53 0 seconds so far for this submitter
04/11/18 12:18:53 0 seconds so far for this schedd
04/11/18 12:18:53     Got NO_MORE_JOBS;  schedd has no more requests
04/11/18 12:18:53     Request 00055.00000: autocluster 3 (request count
1 of 1)
04/11/18 12:18:53       Rejected 55.0 group_OTHER.other.grid@xxxxxxx
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>:
no match found
04/11/18 12:18:53     Request 00056.00000: autocluster 8 (request count
1 of 11)
04/11/18 12:18:53       Rejected 56.0 group_OTHER.other.grid@xxxxxxx
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>:
no match found
04/11/18 12:18:53     Request 00059.00000: autocluster 9 (request count
1 of 1)
04/11/18 12:18:53       Rejected 59.0 group_OTHER.other.grid@xxxxxxx
<131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>:
no match found
04/11/18 12:18:53     Got NO_MORE_JOBS;  schedd has no more requests
04/11/18 12:18:53   Negotiating with group_OTHER.other.chbeyer@xxxxxxx
at
<131.169.56.33:9620?addrs=131.169.56.33-9620+[--1]-9620&noUDP&sock=2496730_b07b_6>
04/11/18 12:18:53 0 seconds so far for this submitter
04/11/18 12:18:53 0 seconds so far for this schedd
04/11/18 12:18:53     Got NO_MORE_JOBS;  schedd has no more requests
04/11/18 12:18:53     Request 00118.00000: autocluster 1 (request count
1 of 1)
04/11/18 12:18:53       Rejected 118.0 group_OTHER.other.chbeyer@xxxxxxx
<131.169.56.33:9620?addrs=131.169.56.33-9620+[--1]-9620&noUDP&sock=2496730_b07b_6>:
no match found
04/11/18 12:18:53     Got NO_MORE_JOBS;  schedd has no more requests
04/11/18 12:18:53  negotiateWithGroup resources used scheddAds length 2


[3]
> SchedLog
04/11/18 12:18:01 (pid:22384) Number of Active Workers 0
04/11/18 12:18:03 (pid:22384) Number of Active Workers 0
04/11/18 12:18:04 (pid:22384) Number of Active Workers 0
04/11/18 12:18:04 (pid:22384) Number of Active Workers 0
04/11/18 12:18:07 (pid:22384) Number of Active Workers 0
04/11/18 12:18:14 (pid:22384) Number of Active Workers 0
04/11/18 12:18:23 (pid:22384) Activity on stashed negotiator socket:
<131.169.56.33:28841>
04/11/18 12:18:23 (pid:22384) Using negotiation protocol: NEGOTIATE
04/11/18 12:18:23 (pid:22384) Negotiating for owner:
group_OTHER.other.grid@xxxxxxx
04/11/18 12:18:23 (pid:22384) Finished negotiating for
group_OTHER.other.grid in local pool: 0 matched, 3 rejected
04/11/18 12:18:53 (pid:22384) Activity on stashed negotiator socket:
<131.169.56.33:28841>
04/11/18 12:18:53 (pid:22384) Using negotiation protocol: NEGOTIATE
04/11/18 12:18:53 (pid:22384) Negotiating for owner:
group_OTHER.other.grid@xxxxxxx
04/11/18 12:18:53 (pid:22384) Finished negotiating for
group_OTHER.other.grid in local pool: 0 matched, 3 rejected

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature