[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs take ages to start



Hi Greg,

 

I ran some jobs while having two shells showing the outputs of watch -n 1 tail -n 100 /var/log/condor/NegotiatorLog and watch -n 1 tail -n 100 /var/log/condor/SchedLog on the CM and the Submit node, respectively.

 

What I can tell you for sure is that the problem is not with the starter. The job goes in running state and finishes within seconds after the two logs mentioned above report a successful negotiation.

 

The SchedLog showed the following:

  1. Shortly after submitting the job, it showed this:
    04/24/23 10:24:52 (pid:84) (cid:35) actOnJobs: didn't do any work, aborting
  2. A few seconds later, it looks like there was some trouble communicating with the negotiator:
    04/24/23 10:24:57 (pid:84) Can't find address for negotiator
    04/24/23 10:24:57 (pid:84) Failed to send RESCHEDULE to unknown daemon:
    I had just restarted my test pool, maybe it did not have time to fully boot back up yet.
  3. 15 seconds later, it could talk with the negotiator again:
    04/24/23 10:25:11 (pid:84) Using negotiation protocol: NEGOTIATE
    04/24/23 10:25:11 (pid:84) Negotiating for owner: default.mmepi@xxxxxxxxxxxxxxxxxx
    04/24/23 10:25:11 (pid:84) Rebuilt prioritized runnable job list in 0.000s.
    04/24/23 10:25:11 (pid:84) Finished sending rrls to negotiator
    04/24/23 10:25:11 (pid:84) Finished sending RRL for default.mmepi
  4. For the next 5 minutes, it only showed these two lines, again and again:
    04/24/23 10:30:10 (pid:84) Number of Active Workers 0
    04/24/23 10:30:11 (pid:84) Received a superuser command
  5. Then, exactly 5 minutes after step 3, this happened:
    04/24/23 10:30:12 (pid:84) Activity on stashed negotiator socket: <172.21.0.1:40539>
    04/24/23 10:30:12 (pid:84) Using negotiation protocol: NEGOTIATE
    04/24/23 10:30:12 (pid:84) Negotiating for owner: default.mmepi@xxxxxxxxxxxxxxxxxx
    04/24/23 10:30:12 (pid:84) AutoCluster:config() significant attributes changed to MachineLastMatchTime Offline RemoteOwner RequestCpus RequestDisk RequestMemory TotalJobRuntime ConcurrencyLimits FlockTo Rank Requirements
    04/24/23 10:30:12 (pid:84) Rebuilt prioritized runnable job list in 0.000s.
    04/24/23 10:30:12 (pid:84) Finished sending rrls to negotiator
    04/24/23 10:30:12 (pid:84) Finished sending RRL for default.mmepi
    04/24/23 10:30:12 (pid:84) Activity on stashed negotiator socket: <172.21.0.1:40539>
    04/24/23 10:30:12 (pid:84) Using negotiation protocol: NEGOTIATE
    04/24/23 10:30:12 (pid:84) Negotiating for owner: default.mmepi@xxxxxxxxxxxxxxxxxx
    04/24/23 10:30:12 (pid:84) Negotiation ended - 1 jobs matched
    04/24/23 10:30:12 (pid:84) Finished negotiating for default.mmepi in local pool: 1 matched, 0 rejected
    04/24/23 10:30:12 (pid:84) Starting add_shadow_birthdate(1.0)
    04/24/23 10:30:12 (pid:84) Started shadow for job 1.0 on slot1@xxxxxxxxxxxxxxxxxxxxx <172.21.0.4:9618?CCBID=172.21.0.2:9618%3faddrs%3d172.21.0.2-9618%26alias%3dcondor.test.cm%26noUDP%26sock%3dcollector#2&PrivNet=condor.test.execute.1&addrs=172.21.0.4-9618&alias=condor.test.execute.1&noUDP&sock=startd_28_c8bd> for default.mmepi, (shadow pid = 4196)
  6. After that, the job went from Idle to Running and completed shortly after

 

For the NegotiatorLog:

  1. At the exact time where the above step 3 happened, there is a corresponding entry in the Negotiator logs:
    04/24/23 10:25:11 ---------- Started Negotiation Cycle ----------
    04/24/23 10:25:11 Phase 1:  Obtaining ads from collector ...
    04/24/23 10:25:11   Getting startd private ads ...
    04/24/23 10:25:11   Getting Scheduler, Submitter and Machine ads ...
    04/24/23 10:25:11   Sorting 1 ads ...
    04/24/23 10:25:11 Got ads: 1 public and 0 private
    04/24/23 10:25:11 Public ads include 1 submitter, 0 startd
    04/24/23 10:25:11 Phase 2:  Performing accounting ...
    04/24/23 10:25:11 Phase 3:  Sorting submitter ads by priority ...
    04/24/23 10:25:11 Starting prefetch round; 1 potential prefetches to do.
    04/24/23 10:25:11 Starting prefetch negotiation for default.mmepi@xxxxxxxxxxxxxxxxxx.
    04/24/23 10:25:11     Got NO_MORE_JOBS;  schedd has no more requests
    04/24/23 10:25:11 Prefetch summary: 1 attempted, 1 successful.
    04/24/23 10:25:11 Phase 4.1:  Negotiating with schedds ...
    04/24/23 10:25:11   Negotiating with default.mmepi@xxxxxxxxxxxxxxxxxx at <172.21.0.3:9618?addrs=172.21.0.3-9618&alias=condor.test.submit&noUDP&sock=schedd_65_29be>
    04/24/23 10:25:11 0 seconds so far for this submitter
    04/24/23 10:25:11 0 seconds so far for this schedd
    04/24/23 10:25:11   Negotiation with default.mmepi@xxxxxxxxxxxxxxxxxx skipped because submitterCeiling remaining is 2147483647
    04/24/23 10:25:11  negotiateWithGroup resources used submitterAds length 0
    04/24/23 10:25:11 ---------- Finished Negotiation Cycle ----------

    Looks like this submitterCeiling thing is causing the delay ?
  2. Then, absolutely nothing for 5 minutes:
    04/24/23 10:29:52 ---------- Started Negotiation Cycle ----------
    04/24/23 10:29:52 Phase 1:  Obtaining ads from collector ...
    04/24/23 10:29:52   Getting startd private ads ...
    04/24/23 10:29:52   Getting Scheduler, Submitter and Machine ads ...
    04/24/23 10:29:52   Sorting 1 ads ...
    04/24/23 10:29:52 Got ads: 1 public and 1 private
    04/24/23 10:29:52 Public ads include 0 submitter, 1 startd
    04/24/23 10:29:52 Phase 2:  Performing accounting ...
    04/24/23 10:29:52 Phase 3:  Sorting submitter ads by priority ...
    04/24/23 10:29:52 Starting prefetch round; 0 potential prefetches to do.
    04/24/23 10:29:52 Prefetch summary: 0 attempted, 0 successful.
    04/24/23 10:29:52 Phase 4.1:  Negotiating with schedds ...
    04/24/23 10:29:52  negotiateWithGroup resources used submitterAds length 0
    04/24/23 10:29:52 ---------- Finished Negotiation Cycle ----------
  3. And again, after exactly five minutes:
    04/24/23 10:30:12 ---------- Started Negotiation Cycle ----------
    04/24/23 10:30:12 Phase 1:  Obtaining ads from collector ...
    04/24/23 10:30:12   Getting startd private ads ...
    04/24/23 10:30:12   Getting Scheduler, Submitter and Machine ads ...
    04/24/23 10:30:12   Sorting 2 ads ...
    04/24/23 10:30:12 Got ads: 2 public and 1 private
    04/24/23 10:30:12 Public ads include 1 submitter, 1 startd

04/24/23 10:30:12 Phase 2:  Performing accounting ...
04/24/23 10:30:12 Phase 3:  Sorting submitter ads by priority ...
04/24/23 10:30:12 Starting prefetch round; 1 potential prefetches to do.
04/24/23 10:30:12 Starting prefetch negotiation for default.mmepi@xxxxxxxxxxxxxxxxxx.
04/24/23 10:30:12     Got NO_MORE_JOBS;  schedd has no more requests
04/24/23 10:30:12 Prefetch summary: 1 attempted, 1 successful.
04/24/23 10:30:12 Phase 4.1:  Negotiating with schedds ...
04/24/23 10:30:12   Negotiating with default.mmepi@xxxxxxxxxxxxxxxxxx at <172.21.0.3:9618?addrs=172.21.0.3-9618&alias=condor.test.submit&noUDP&sock=schedd_65_29be>
04/24/23 10:30:12 0 seconds so far for this submitter
04/24/23 10:30:12 0 seconds so far for this schedd
04/24/23 10:30:12     Request 00001.00000: autocluster 1 (request count 1 of 1)
04/24/23 10:30:12       Matched 1.0 default.mmepi@xxxxxxxxxxxxxxxxxx <172.21.0.3:9618?addrs=172.21.0.3-9618&alias=condor.test.submit&noUDP&sock=schedd_65_29be> preempting none <172.21.0.4:9618?CCBID=172.21.0.2:9618%3faddrs%3d172.21.0.2-9618%26alias%3dcondor.test.cm%26noUDP%26sock%3dcollector#2&PrivNet=condor.test.execute.1&addrs=172.21.0.4-9618&alias=condor.test.execute.1&noUDP&sock=startd_28_c8bd> slot1@xxxxxxxxxxxxxxxxxxxxx
04/24/23 10:30:12       Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxx
04/24/23 10:30:12  negotiateWithGroup resources used submitterAds length 1
04/24/23 10:30:12 ---------- Finished Negotiation Cycle ----------

And again, from there the job goes in running state and completes.

 

 

So, it looks like this submitterCeiling thing is what causes my issue. But I donât get it, the user I am using to submit has no other job in the pool, there are no other jobs running anywhere.

Why does it consider it has reached that ceiling, why does it wait exactly 5 minutes and why is it fine after 5 minutes ?

 

Thanks,

 

GaÃtan

 


Gaetan Geffroy
Junior Software Engineer
Terma GmbH
T +49 6151 86005 43 (direct)
 


 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Greg Thain via HTCondor-users
Sent: Friday, April 21, 2023 20:01
To: htcondor-users@xxxxxxxxxxx
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Jobs take ages to start

 

CAUTION: This email originated from outside of Terma. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On 4/21/23 07:48, Gaetan Geffroy wrote:

Hi,

 

Some of my jobs stay in IDLE for a really long time compared to what I have configured.

On the submit node, I set NEGOTIATOR_INTERVAL to 5 seconds, expecting jobs to start extremely quickly after being submitted.

This is a local pool for test and training purposes, so I actually want the jobs to start and execute almost immediately.

 

When looking at the NegotiatorLog file, it looks like the negotiation cycles are happening as often as intended, yet most of my jobs stay in queue for 5 minutes.

Those are docker universe jobs. Condor_q -better-analyze tells me there are machines capable of running my jobs.

Even weirder, I tried to download one of my docker images as a TAR archive and load manually in Docker, and then the jobs using that image start immediately.

The same jobs using the same image but from a Docker repo take 5 minutes to start.

 

Is there another knob I could turn to speed up the process ?

 

 

This depends on where in the process the delay is.  Do know know if this is a delay in matchmaking?  in the schedd?  In the starter once the job has begun to start?

-greg