[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Nodes not taking jobs although both have matching resources/requests



Hi all,

the issue with my test node seems to be the same that the partitionable
slots are not getting splitted.

Christoph Beyer suggested to try Zach's workaround and deploy on the
node the rule
  CONSUMPTION_POLICY = True
to split the slot on the negotiator instead on the schedd.

which worked and my toy jobs are starting :)

Cheers,
  Thomas

On 2018-04-11 12:29, Thomas Hartmann wrote:
> Hi,
> 
> I am looking for ideas, why two nodes have troubles accepting/starting
> jobs. Both nodes have been recently spawned defined as
>   DEV_RESOURCE = true
> - on which no other nodes in the cluster match.
> 
> But all my jobs with 'DEV_RESOURCE' as only requirement do not start -
> although the jobs request [1.a] and nodes resources [1.b] match 'in
> principal' - with a slot and a job being found.
> 
> The nominal group's share (aka OTHER) should be sufficient (and also no
> other user/group's job is matching the nodes' resources, i.e., the nodes
> are idling).
> 
> The negotiator rejects the jobs as it cannot find a match [2] - where I
> am convinced that it should match (comparing the nodes' ads with the
> request it should(?) fit) (with the same info ending up at the scheduler
> [3])
> 
> Maybe somebody has a hint for me, why the matchmaking might be failing
> here??
> 
> Cheers,
>   Thomas
> 
> [1.a]
>> condor_q -better-analyze 55.0
> 
> 
> -- Schedd: grid-vm08.desy.de : <131.169.223.234:9620?...
> The Requirements expression for job 55.000 is
> 
>     (TARGET.DEV_RESOURCE) && (TARGET.Arch == "X86_64") && (TARGET.OpSys
> == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >=
> RequestMemory) && (TARGET.HasFileTransfer)
> 
> Job 55.000 defines the following attributes:
> 
>     DiskUsage = 3
>     RequestDisk = DiskUsage
>     RequestMemory = 2500
> 
> The Requirements expression for job 55.000 reduces to these conditions:
> 
>          Slots
> Step    Matched  Condition
> -----  --------  ---------
> [0]           2  TARGET.DEV_RESOURCE
> 
> No successful match recorded.
> Last failed match: Wed Apr 11 11:39:52 2018
> 
> Reason for last match failure: no match found
> 
> 055.000:  Run analysis summary ignoring user priority.  Of 353 machines,
>     351 are rejected by your job's requirements
>       0 reject your job because of their own requirements
>       0 match and are already running your jobs
>       0 match but are serving other users
>       2 are able to run your job
> 
> [1.b]
>>  condor_q -better-analyze 55.0  -reverse -machine wn12-test.desy.de
> 
> 
> -- Schedd: grid-vm08.desy.de : <131.169.223.234:9620?...
> 
> -- Slot: slot1@xxxxxxxxxxxxxxxxx : Analyzing matches for 1 Jobs in 1
> autoclusters
> 
> The Requirements expression for this slot is
> 
>     (START) && (IsValidCheckpointPlatform) &&
>             (WithinResourceLimits)
> 
>   START is
>     (NODE_IS_HEALTHY is true) &&
>             (StartJobs is true)
> 
>   IsValidCheckpointPlatform is
>     (TARGET.JobUniverse isnt 1 ||
>             ((MY.CheckpointPlatform isnt undefined) &&
>                 ((TARGET.LastCheckpointPlatform is MY.CheckpointPlatform) ||
>                     (TARGET.NumCkpts == 0))))
> 
>   WithinResourceLimits is
>     (ifThenElse(TARGET._condor_RequestCpus isnt undefined,MY.Cpus > 0 &&
>         TARGET._condor_RequestCpus <=
> MY.Cpus,ifThenElse(TARGET.RequestCpus isnt undefined,MY.Cpus > 0 &&
>           TARGET.RequestCpus <= MY.Cpus,1 <= MY.Cpus)) &&
>       ifThenElse(TARGET._condor_RequestMemory isnt undefined,MY.Memory >
> 0 &&
>         TARGET._condor_RequestMemory <=
> MY.Memory,ifThenElse(TARGET.RequestMemory isnt undefined,MY.Memory > 0 &&
>           TARGET.RequestMemory <= MY.Memory,false)) &&
>       ifThenElse(TARGET._condor_RequestDisk isnt undefined,MY.Disk > 0 &&
>         TARGET._condor_RequestDisk <=
> MY.Disk,ifThenElse(TARGET.RequestDisk isnt undefined,MY.Disk > 0 &&
>           TARGET.RequestDisk <= MY.Disk,false)))
> 
> This slot defines the following attributes:
> 
>     CheckpointPlatform = "LINUX X86_64 3.10.0-693.21.1.el7.x86_64 normal
> N/A ssse3 sse4_1 sse4_2"
>     Cpus = 16
>     Disk = 68089928
>     Memory = 48124
>     NODE_IS_HEALTHY = true
>     StartJobs = true
> 
> Job 55.0 has the following attributes:
> 
>     TARGET.JobUniverse = 5
>     TARGET.NumCkpts = 0
>     TARGET.RequestCpus = 1
>     TARGET.RequestDisk = 3
>     TARGET.RequestMemory = 2500
> 
> The Requirements expression for this slot reduces to these conditions:
> 
>        Clusters
> Step    Matched  Condition
> -----  --------  ---------
> [3]           1  IsValidCheckpointPlatform
> [5]           1  WithinResourceLimits
> 
> slot1@xxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
>     1 (100.00 %) match both slot and job requirements.
>     1 match the requirements of this slot.
>     1 have job requirements that match this slot.
> 
> 
> [2]
>> NegotiatorLog
> ...
> 04/11/18 12:18:44 ---------- Started Negotiation Cycle ----------
> 04/11/18 12:18:44 Phase 1:  Obtaining ads from collector ...
> 04/11/18 12:18:44   Getting startd private ads ...
> 04/11/18 12:18:45   Getting Scheduler, Submitter and Machine ads ...
> 04/11/18 12:18:50   Sorting 11071 ads ...
> 04/11/18 12:18:50 Got ads: 11071 public and 11021 private
> 04/11/18 12:18:50 Public ads include 25 submitter, 11021 startd
> 04/11/18 12:18:51 Phase 2:  Performing accounting ...
> ...
> 04/11/18 12:18:53 group quotas: WARNING: dynamic quota for group
> group_OPS rescaled from 0.9 to 0.321429
> 04/11/18 12:18:53 group quotas: WARNING: dynamic quota for group
> group_OTHER rescaled from 0.1 to 0.0357143
> 04/11/18 12:18:53 group quotas: allocation round 1
> 04/11/18 12:18:53 group quotas: groups= 9  requesting= 5  served= 5
> unserved= 0  slots= 10911  requested= 25736  allocated= 25736  surplus=
> 3422  maxdelta= 6842
> 04/11/18 12:18:53 group quotas: entering RR iteration n= 6842
> ...
> 04/11/18 12:18:53 Group group_OPS - skipping, zero slots allocated
> 04/11/18 12:18:53 Group group_OTHER - BEGIN NEGOTIATION
> 04/11/18 12:18:53 Phase 3:  Sorting submitter ads by priority ...
> 04/11/18 12:18:53 Phase 4.1:  Negotiating with schedds ...
> 04/11/18 12:18:53   Negotiating with group_OTHER.other.grid@xxxxxxx at
> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>
> 04/11/18 12:18:53 0 seconds so far for this submitter
> 04/11/18 12:18:53 0 seconds so far for this schedd
> 04/11/18 12:18:53     Got NO_MORE_JOBS;  schedd has no more requests
> 04/11/18 12:18:53     Request 00055.00000: autocluster 3 (request count
> 1 of 1)
> 04/11/18 12:18:53       Rejected 55.0 group_OTHER.other.grid@xxxxxxx
> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>:
> no match found
> 04/11/18 12:18:53     Request 00056.00000: autocluster 8 (request count
> 1 of 11)
> 04/11/18 12:18:53       Rejected 56.0 group_OTHER.other.grid@xxxxxxx
> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>:
> no match found
> 04/11/18 12:18:53     Request 00059.00000: autocluster 9 (request count
> 1 of 1)
> 04/11/18 12:18:53       Rejected 59.0 group_OTHER.other.grid@xxxxxxx
> <131.169.223.234:9620?addrs=131.169.223.234-9620+[2001-638-700-10df--1-ea]-9620&noUDP&sock=22340_f5d7_3>:
> no match found
> 04/11/18 12:18:53     Got NO_MORE_JOBS;  schedd has no more requests
> 04/11/18 12:18:53   Negotiating with group_OTHER.other.chbeyer@xxxxxxx
> at
> <131.169.56.33:9620?addrs=131.169.56.33-9620+[--1]-9620&noUDP&sock=2496730_b07b_6>
> 04/11/18 12:18:53 0 seconds so far for this submitter
> 04/11/18 12:18:53 0 seconds so far for this schedd
> 04/11/18 12:18:53     Got NO_MORE_JOBS;  schedd has no more requests
> 04/11/18 12:18:53     Request 00118.00000: autocluster 1 (request count
> 1 of 1)
> 04/11/18 12:18:53       Rejected 118.0 group_OTHER.other.chbeyer@xxxxxxx
> <131.169.56.33:9620?addrs=131.169.56.33-9620+[--1]-9620&noUDP&sock=2496730_b07b_6>:
> no match found
> 04/11/18 12:18:53     Got NO_MORE_JOBS;  schedd has no more requests
> 04/11/18 12:18:53  negotiateWithGroup resources used scheddAds length 2
> 
> 
> [3]
>> SchedLog
> 04/11/18 12:18:01 (pid:22384) Number of Active Workers 0
> 04/11/18 12:18:03 (pid:22384) Number of Active Workers 0
> 04/11/18 12:18:04 (pid:22384) Number of Active Workers 0
> 04/11/18 12:18:04 (pid:22384) Number of Active Workers 0
> 04/11/18 12:18:07 (pid:22384) Number of Active Workers 0
> 04/11/18 12:18:14 (pid:22384) Number of Active Workers 0
> 04/11/18 12:18:23 (pid:22384) Activity on stashed negotiator socket:
> <131.169.56.33:28841>
> 04/11/18 12:18:23 (pid:22384) Using negotiation protocol: NEGOTIATE
> 04/11/18 12:18:23 (pid:22384) Negotiating for owner:
> group_OTHER.other.grid@xxxxxxx
> 04/11/18 12:18:23 (pid:22384) Finished negotiating for
> group_OTHER.other.grid in local pool: 0 matched, 3 rejected
> 04/11/18 12:18:53 (pid:22384) Activity on stashed negotiator socket:
> <131.169.56.33:28841>
> 04/11/18 12:18:53 (pid:22384) Using negotiation protocol: NEGOTIATE
> 04/11/18 12:18:53 (pid:22384) Negotiating for owner:
> group_OTHER.other.grid@xxxxxxx
> 04/11/18 12:18:53 (pid:22384) Finished negotiating for
> group_OTHER.other.grid in local pool: 0 matched, 3 rejected
> 
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature