[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs scheduling with flock_to

Hello Experts,Â

Looking for some assistance in troubleshooting sched issue with flock configuration.Â

Issue Description:

Jobs remainÂin queueÂin idle status for a long time when ample amount of slots are available (in pools configured in flocking list of scheduler) to accomodateÂjobs. However jobs with requirement of primary cluster are getting scheduled without any issue.Â

Jobs have requirement of match for pool C,E

FLOCK_TO = A, B, C, D, E, F


- better-analyze never helps in this case to identify the issue.Â

condor_q <jobid> -better-analyze -p <poolname>Â

- Negotiation logs of C,E pool not showing the jobs to be consider for match making that means sched is not presenting the jobs for matchmaking.Â

04/14/20 02:54:43 Â Negotiating with group_user1.testuser1@xxxxxxxxxxx at <xx.xx.xx.xx:9618?addrs=xx.xx.xx.xx-9618&noUDP&sock=6712_c42a_3>
04/14/20 02:54:43 0 seconds so far for this submitter
04/14/20 02:54:43 0 seconds so far for this schedd
04/14/20 02:54:43 Â Â Got NO_MORE_JOBS; Âschedd has no more requests

- As per understanding it shouldn't have even consider other pools for match making but when I check the scheduler debug logs, it's considering D,F pool more times than C,E.. Why is't so?

- To fix the issue I have to change the flocking order as A,B were cloud pools but again they were not mentioned in requirement of job.Â

 FLOCK_TO = C, D, E, F, A, B

Job match making became very quick after doing this change.Â

grep 'Finished negotiating for group_user1.testuser1@xxxxxxxxxxx Âin pool' /var/log/condor/SchedLog | grep '04/14/20 03:'| awk '{print $(NF-4)}'| sor
t | uniq -c

For how long job needs to wait on one pool in flock list before moving to next one and if we are passing requirements in job shouldn't it only consider the pools mentioned in requirement? Also is thr any timeout value after which jobs give up for matchmaking?

Thanks & Regards,
Vikrant Aggarwal