Because I will be talking to RSEâs who might be skeptical that the extra process steps have tangible benefits, Iâd like to be able to explain some of the robustness features enabled by this design.
- If the Schedd goes down, what happens to the work on execute machines when it finishes? Would the Shadow still be running so the job output would be transferred?
- Similarly, if the Startd stops, would work carry on as normal for the jobs currently running?
I will definitely test out these and other daemon questions on my local minicondor instance, but I figured Iâd ask on these two first.
I'm really enjoying these questions, as they get at the heart of what HTCondor does.ÂÂ One of the goals of HTCondor is to reliably run to completion workflows of jobs in the presence of networks, operating systems and machines that appear to be out to get us.ÂÂ At first, one might think that this means that HTCondor should take extreme measures to keep running any job that it has started, but it turns out that this isn't quite our prime directive.Â The more important property to maintain is that, in the presence of errors and crashes, that we leave the machine in a state where we can continue to operate after a reboot or restart.Â So, we seek to "manage" everything we create, so that we can measure their usage and clean up after it when needed.Â We try to never have a job or process that is running without supervision.
So, to try to maintain these properties, and as the startd
manages the starter, and the starter manages the job, if the
startd goes away, the starter notices, kills the job, cleans up
and exits, leaving a "clean" machine behind it.
In a similar way, on the submit side, the schedd manages shadows.Â If the schedd dies, the shadows exit, but because the starter is managing the job, the job continues running.Â Now, if the schedd never comes back, we don't want the job to run forever, so in this case, there is a lease, a timeout whereby the starter will only continue to run the job for some fixed amount of time before it gives up hope that the schedd is returning.Â At that point, it kills the job.
I hope this helps with your questions and your quest,