[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor diagram of daemons



Because I will be talking to RSEâs who might be skeptical that the extra process steps have tangible benefits, Iâd like to be able to explain some of the robustness features enabled by this design.

 

  • If the Schedd goes down, what happens to the work on execute machines when it finishes? Would the Shadow still be running so the job output would be transferred?
  • Similarly, if the Startd stops, would work carry on as normal for the jobs currently running?

 

I will definitely test out these and other daemon questions on my local minicondor instance, but I figured Iâd ask on these two first.

Cheers,
Matt

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Greg Thain
Sent: 04 April 2022 06:45 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] HTCondor diagram of daemons

 

CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.

 

On 4/4/22 12:21, West, Matthew wrote:

Thanks Greg for finding the diagrams (slides 23-32) from T.T. I was looking for. The notes on slides 28-30 of the PPTX version go into finer sequential detail about the steps in Claim acquisition. Some additional questions:

  1. What is Q on slide 27? I understand that J is the job classad and S is the classad for the execute machine.


An important part of the design is that we want the system as a whole to be able to support more jobs than any one scheduler process can hold. (Even though there are many sites with just one schedd).  So, when the schedd wants to tell the collector that it needs matches from the negotiator, it can't just upload all the jobs to the collector, as there might be way too many.  Instead, it sends "submitter records", which condense the requests into a single classad record per submitter, with the number of requests each submitter in that schedd has. If you are curious, these are visible with the "condor_status -submitter" command.  I'm not sure how we got "Q" out of that.

  1.  
  2. Does the Shadow talk to the Startd and tell it to make a Starter?

Yes.  Once the schedd has been given a slot to use from the negotiator, the schedd "claim"s the slot, for exclusive (but time-limited) use by that schedd.  Assuming that succeeds, the starter "activate"s the claim to run a single job, which causes the startd to create the starter.

  1.  
  2. Where does file transfer go (inbound and outbound) in these steps

File transfer is handled by the shadow and the starter.  Input xfer happens right after activation, and Output after the job completes, but the claim is still active during file xfer.

  1.  
  2. Are there additional communications between processes once a single job is completed?

 

Once the first job on a claim completes, if the amount of time it took is less than CLAIM_WORKLIFE, and the schedd can find another job that fits in the slot, it is free to launch another starter to reuse the existing claim, but with a new activation for the new job.



These charts are really nice to show how one can build a robust system from a number of disparately connected parts.

 

Thanks, and good luck with your talk,

 

-greg