[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Checkpoints in HTCondor



The vanilla-universe application-level checkpoint capability is present, but not documented, in recent 8.8 stable releases. It works as documented in the 8.9 manual under "Self-Checkpointing Applications," except that the "CheckpointExitCode" attribute should be "SuccessCheckpointExitCode."

I set MY.SuccessCheckpointExitCode to 85 (which is the ERESTART errno) and then have my code write out a checkpoint and exit with that code, as well as specifying output file transfer for the checkpoint to save it during a vacate, and upon the process' exit with that code the condor_starter immediately restarts the job in the same slot and scratch space with the same arguments, and it's a very quick and seamless operation. It's hard to even tell it has happened without looking at it carefully. The NumJobStarts attribute and other checkpoint-related attributes don't even change - but that may just be an incomplete implementation of an undocumented feature in my 8.8.7 installation.

I'm in the process of writing up materials to encourage my users to take advantage of it - I've had one too many 60-day jobs (namely, one) lost to electrical maintenance shutdowns because the user was writing checkpoint files to an ephemeral path inside a container instead of to scratch space, and didn't implement checkpoint-resume code in any case despite my recommendations.

I was excited to see the vanilla universe checkpoint capability when it was described at HTCondor Week a couple of years ago, and I'm pleased that it's available now. I encourage you to check it out.

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company