[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor diagram of daemons

I finished the presentation but as I am not an expert in the material, I would appreciate the community here taking a look. Are there particular rules about email attachments for this list?




From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Greg Thain
Sent: 06 April 2022 05:52 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] HTCondor diagram of daemons


CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.


On 4/5/22 16:36, West, Matthew wrote:

Because I will be talking to RSEâs who might be skeptical that the extra process steps have tangible benefits, Iâd like to be able to explain some of the robustness features enabled by this design.


  1. If the Schedd goes down, what happens to the work on execute machines when it finishes? Would the Shadow still be running so the job output would be transferred?
  2. Similarly, if the Startd stops, would work carry on as normal for the jobs currently running?


I will definitely test out these and other daemon questions on my local minicondor instance, but I figured Iâd ask on these two first.



I'm really enjoying these questions, as they get at the heart of what HTCondor does.   One of the goals of HTCondor is to reliably run to completion workflows of jobs in the presence of networks, operating systems and machines that appear to be out to get us.   At first, one might think that this means that HTCondor should take extreme measures to keep running any job that it has started, but it turns out that this isn't quite our prime directive.  The more important property to maintain is that, in the presence of errors and crashes, that we leave the machine in a state where we can continue to operate after a reboot or restart.  So, we seek to "manage" everything we create, so that we can measure their usage and clean up after it when needed.  We try to never have a job or process that is running without supervision.


So, to try to maintain these properties, and as the startd manages the starter, and the starter manages the job, if the startd goes away, the starter notices, kills the job, cleans up and exits, leaving a "clean" machine behind it.

In a similar way, on the submit side, the schedd manages shadows.  If the schedd dies, the shadows exit, but because the starter is managing the job, the job continues running.  Now, if the schedd never comes back, we don't want the job to run forever, so in this case, there is a lease, a timeout whereby the starter will only continue to run the job for some fixed amount of time before it gives up hope that the schedd is returning.  At that point, it kills the job.

I hope this helps with your questions and your quest,