Hi Greg & Todd,
I donât expect and kind of donât want regular submitters to care about the guts of the system. I am asking these sorts of questions due to moving from an RSE to a SysAdmin roll here at Exeter and these are the bits of lore that arenât easy
to fit in online docs. Also if I want to convince my line-manager to use HTCondor on our next system, I have to be able to answer technical questions from folks who arenât familiar with the tool-suite.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Greg Thain
CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.
On 4/5/22 16:36, West, Matthew wrote:
I'm really enjoying these questions, as they get at the heart of what HTCondor does. One of the goals of HTCondor is to reliably run to completion workflows of jobs in the presence of networks, operating systems and machines that appear to be out to get us. At first, one might think that this means that HTCondor should take extreme measures to keep running any job that it has started, but it turns out that this isn't quite our prime directive. The more important property to maintain is that, in the presence of errors and crashes, that we leave the machine in a state where we can continue to operate after a reboot or restart. So, we seek to "manage" everything we create, so that we can measure their usage and clean up after it when needed. We try to never have a job or process that is running without supervision.
So, to try to maintain these properties, and as the startd manages the starter, and the starter manages the job, if the startd goes away, the starter notices, kills the job, cleans up and exits, leaving a "clean" machine behind it.
In a similar way, on the submit side, the schedd manages shadows. If the schedd dies, the shadows exit, but because the starter is managing the job, the job continues running. Now, if the schedd never comes back, we don't want the job to run forever, so in this case, there is a lease, a timeout whereby the starter will only continue to run the job for some fixed amount of time before it gives up hope that the schedd is returning. At that point, it kills the job.
I hope this helps with your questions and your quest,