Because I will be talking to RSEâs who might be skeptical that the extra process steps have tangible benefits, Iâd like to be able to explain some of the robustness features enabled by this design.
I will definitely test out these and other daemon questions on my local minicondor instance, but I figured Iâd ask on these two first.
On 4/4/22 12:21, West, Matthew wrote:
Thanks Greg for finding the diagrams (slides 23-32) from T.T. I was looking for. The notes on slides 28-30 of the PPTX version go into finer sequential detail about the steps in Claim acquisition. Some additional questions:
- What is Q on slide 27? I understand that J is the job classad and S is the classad for the execute machine.
An important part of the design is that we want the system as a whole to be able to support more jobs than any one scheduler process can hold. (Even though there are many sites with just one schedd). So, when the schedd wants to tell the collector that it
needs matches from the negotiator, it can't just upload all the jobs to the collector, as there might be way too many. Instead, it sends "submitter records", which condense the requests into a single classad record per submitter, with the number of requests
each submitter in that schedd has. If you are curious, these are visible with the "condor_status -submitter" command. I'm not sure how we got "Q" out of that.
- Does the Shadow talk to the Startd and tell it to make a Starter?
Yes. Once the schedd has been given a slot to use from the negotiator, the schedd "claim"s the slot, for exclusive (but time-limited) use by that schedd. Assuming that succeeds, the starter "activate"s
the claim to run a single job, which causes the startd to create the starter.
- Where does file transfer go (inbound and outbound) in these steps
File transfer is handled by the shadow and the starter. Input xfer happens right after activation, and Output after the job completes, but the claim is still active during file xfer.
- Are there additional communications between processes once a single job is completed?
Once the first job on a claim completes, if the amount of time it took is less than CLAIM_WORKLIFE, and the schedd can find another job that fits in the slot, it is free to launch another starter to reuse the existing claim, but with a new activation for
the new job.
These charts are really nice to show how one can build a robust system from a number of disparately connected parts.
Thanks, and good luck with your talk,