I finished the presentation but as I am not an expert in the material, I would appreciate the community here taking a look. Are there particular rules about email attachments for this list?
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Greg Thain
CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.
On 4/5/22 16:36, West, Matthew wrote:
I'm really enjoying these questions, as they get at the heart of what HTCondor does. One of the goals of HTCondor is to reliably run to completion workflows of jobs in the presence of networks, operating systems and machines that appear to be out to get us. At first, one might think that this means that HTCondor should take extreme measures to keep running any job that it has started, but it turns out that this isn't quite our prime directive. The more important property to maintain is that, in the presence of errors and crashes, that we leave the machine in a state where we can continue to operate after a reboot or restart. So, we seek to "manage" everything we create, so that we can measure their usage and clean up after it when needed. We try to never have a job or process that is running without supervision.
So, to try to maintain these properties, and as the startd manages the starter, and the starter manages the job, if the startd goes away, the starter notices, kills the job, cleans up and exits, leaving a "clean" machine behind it.
In a similar way, on the submit side, the schedd manages shadows. If the schedd dies, the shadows exit, but because the starter is managing the job, the job continues running. Now, if the schedd never comes back, we don't want the job to run forever, so in this case, there is a lease, a timeout whereby the starter will only continue to run the job for some fixed amount of time before it gives up hope that the schedd is returning. At that point, it kills the job.
I hope this helps with your questions and your quest,