[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Checkpoints in HTCondor



It works as documented in the 8.9 manual under "Self-Checkpointing Applications," except that the "CheckpointExitCode" attribute should be "SuccessCheckpointExitCode."

	This will, of course, be fixed in the next release. :)

It's hard to even tell it has happened without looking at it carefully. The NumJobStarts attribute and other checkpoint-related attributes don't even change - but that may just be an incomplete implementation of an undocumented feature in my 8.8.7 installation.

I'm curious to know what people think of this decision. The exit-and-restart system arose from the need to make sure that (a) the application was done writing its checkpoint and (b) the application wouldn't try to update the checkpoint until HTCondor was done transferring it off the local machine (back to the schedd's SPOOL directory, presently). As such, I've been regarding it as an implementation detail, where incrementing NumJobStarts would actually be less useful and more confusing than retaining its current value.

The other checkpoint-related attributes were inherited from the (removed) standard universe, and we intend to remove them as we clean up the code. Do let us know if any of those specific attributes are of interest for self-checkpointing vanilla-universe jobs, or what other attributes you find use cases for.

- ToddM