Hi, I am looking for ideas, why two nodes have troubles accepting/starting jobs. Both nodes have been recently spawned defined as DEV_RESOURCE = true - on which no other nodes in the cluster match. But all my jobs with 'DEV_RESOURCE' as only requirement do not start - although the jobs request [1.a] and nodes resources [1.b] match 'in principal' - with a slot and a job being found. The nominal group's share (aka OTHER) should be sufficient (and also no other user/group's job is matching the nodes' resources, i.e., the nodes are idling). The negotiator rejects the jobs as it cannot find a match [2] - where I am convinced that it should match (comparing the nodes' ads with the request it should(?) fit) (with the same info ending up at the scheduler [3]) Maybe somebody has a hint for me, why the matchmaking might be failing here?? Cheers, Thomas [1.a] > condor_q -better-analyze 55.0 -- Schedd: grid-vm08.desy.de : <131.169.223.234:9620?... The Requirements expression for job 55.000 is (TARGET.DEV_RESOURCE) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer) Job 55.000 defines the following attributes: DiskUsage = 3 RequestDisk = DiskUsage RequestMemory = 2500 The Requirements expression for job 55.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 2 TARGET.DEV_RESOURCE No successful match recorded. Last failed match: Wed Apr 11 11:39:52 2018 Reason for last match failure: no match found 055.000: Run analysis summary ignoring user priority. Of 353 machines, 351 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 2 are able to run your job [1.b] > condor_q -better-analyze 55.0 -reverse -machine wn12-test.desy.de -- Schedd: grid-vm08.desy.de : <131.169.223.234:9620?... -- Slot: slot1@xxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1 autoclusters The Requirements expression for this slot is (START) && (IsValidCheckpointPlatform) && (WithinResourceLimits) START is (NODE_IS_HEALTHY is true) && (StartJobs is true) IsValidCheckpointPlatform is (TARGET.JobUniverse isnt 1 || ((MY.CheckpointPlatform isnt undefined) && ((TARGET.LastCheckpointPlatform is MY.CheckpointPlatform) || (TARGET.NumCkpts == 0)))) WithinResourceLimits is (ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 && TARGET._condor_RequestCpus <= MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 && TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) && ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory > 0 && TARGET._condor_RequestMemory <= MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0 && TARGET.RequestMemory <= MY.Memory,false)) && ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 && TARGET._condor_RequestDisk <= MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 && TARGET.RequestDisk <= MY.Disk,false))) This slot defines the following attributes: CheckpointPlatform = "LINUX X86_64 3.10.0-693.21.1.el7.x86_64 normal N/A ssse3 sse4_1 sse4_2" Cpus = 16 Disk = 68089928 Memory = 48124 NODE_IS_HEALTHY = true StartJobs = true Job 55.0 has the following attributes: TARGET.JobUniverse = 5 TARGET.NumCkpts = 0 TARGET.RequestCpus = 1 TARGET.RequestDisk = 3 TARGET.RequestMemory = 2500 The Requirements expression for this slot reduces to these conditions: Clusters Step Matched Condition ----- -------- --------- [3] 1 IsValidCheckpointPlatform [5] 1 WithinResourceLimits slot1@xxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs. 1 (100.00 %) match both slot and job requirements. 1 match the requirements of this slot. 1 have job requirements that match this slot. [2] > NegotiatorLog ... 04/11/18 12:18:44 ---------- Started Negotiation Cycle ---------- 04/11/18 12:18:44 Phase 1: Obtaining ads from collector ... 04/11/18 12:18:44 Getting startd private ads ... 04/11/18 12:18:45 Getting Scheduler, Submitter and Machine ads ... 04/11/18 12:18:50 Sorting 11071 ads ... 04/11/18 12:18:50 Got ads: 11071 public and 11021 private 04/11/18 12:18:50 Public ads include 25 submitter, 11021 startd 04/11/18 12:18:51 Phase 2: Performing accounting ... ... 04/11/18 12:18:53 group quotas: WARNING: dynamic quota for group group_OPS rescaled from 0.9 to 0.321429 04/11/18 12:18:53 group quotas: WARNING: dynamic quota for group group_OTHER rescaled from 0.1 to 0.0357143 04/11/18 12:18:53 group quotas: allocation round 1 04/11/18 12:18:53 group quotas: groups= 9 requesting= 5 served= 5 unserved= 0 slots= 10911 requested= 25736 allocated= 25736 surplus= 3422 maxdelta= 6842 04/11/18 12:18:53 group quotas: entering RR iteration n= 6842 ... 04/11/18 12:18:53 Group group_OPS - skipping, zero slots allocated 04/11/18 12:18:53 Group group_OTHER - BEGIN NEGOTIATION 04/11/18 12:18:53 Phase 3: Sorting submitter ads by priority ... 04/11/18 12:18:53 Phase 4.1: Negotiating with schedds ... 04/11/18 12:18:53 Negotiating with group_OTHER.other.grid@xxxxxxx at <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3> 04/11/18 12:18:53 0 seconds so far for this submitter 04/11/18 12:18:53 0 seconds so far for this schedd 04/11/18 12:18:53 Got NO_MORE_JOBS; schedd has no more requests 04/11/18 12:18:53 Request 00055.00000: autocluster 3 (request count 1 of 1) 04/11/18 12:18:53 Rejected 55.0 group_OTHER.other.grid@xxxxxxx <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>: no match found 04/11/18 12:18:53 Request 00056.00000: autocluster 8 (request count 1 of 11) 04/11/18 12:18:53 Rejected 56.0 group_OTHER.other.grid@xxxxxxx <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>: no match found 04/11/18 12:18:53 Request 00059.00000: autocluster 9 (request count 1 of 1) 04/11/18 12:18:53 Rejected 59.0 group_OTHER.other.grid@xxxxxxx <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>: no match found 04/11/18 12:18:53 Got NO_MORE_JOBS; schedd has no more requests 04/11/18 12:18:53 Negotiating with group_OTHER.other.chbeyer@xxxxxxx at <131.169.56.33:9620?addrs=131.169.56.33-9620+[--1]-9620&noUDP&sock=2496730_b07b_6> 04/11/18 12:18:53 0 seconds so far for this submitter 04/11/18 12:18:53 0 seconds so far for this schedd 04/11/18 12:18:53 Got NO_MORE_JOBS; schedd has no more requests 04/11/18 12:18:53 Request 00118.00000: autocluster 1 (request count 1 of 1) 04/11/18 12:18:53 Rejected 118.0 group_OTHER.other.chbeyer@xxxxxxx <131.169.56.33:9620?addrs=131.169.56.33-9620+[--1]-9620&noUDP&sock=2496730_b07b_6>: no match found 04/11/18 12:18:53 Got NO_MORE_JOBS; schedd has no more requests 04/11/18 12:18:53 negotiateWithGroup resources used scheddAds length 2 [3] > SchedLog 04/11/18 12:18:01 (pid:22384) Number of Active Workers 0 04/11/18 12:18:03 (pid:22384) Number of Active Workers 0 04/11/18 12:18:04 (pid:22384) Number of Active Workers 0 04/11/18 12:18:04 (pid:22384) Number of Active Workers 0 04/11/18 12:18:07 (pid:22384) Number of Active Workers 0 04/11/18 12:18:14 (pid:22384) Number of Active Workers 0 04/11/18 12:18:23 (pid:22384) Activity on stashed negotiator socket: <131.169.56.33:28841> 04/11/18 12:18:23 (pid:22384) Using negotiation protocol: NEGOTIATE 04/11/18 12:18:23 (pid:22384) Negotiating for owner: group_OTHER.other.grid@xxxxxxx 04/11/18 12:18:23 (pid:22384) Finished negotiating for group_OTHER.other.grid in local pool: 0 matched, 3 rejected 04/11/18 12:18:53 (pid:22384) Activity on stashed negotiator socket: <131.169.56.33:28841> 04/11/18 12:18:53 (pid:22384) Using negotiation protocol: NEGOTIATE 04/11/18 12:18:53 (pid:22384) Negotiating for owner: group_OTHER.other.grid@xxxxxxx 04/11/18 12:18:53 (pid:22384) Finished negotiating for group_OTHER.other.grid in local pool: 0 matched, 3 rejected
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature