[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] transfer_in/output_files only if they exist



Follow-up question: is there a way to set something like

periodic_transfer_spool = 3600

so that the contents of the job's spool directory can be transferred back to the shadow's spool periodically? In combination with ON_EXIT_OR_EVICT that would give me periodic checkpointing if the job dies unexpectedly, in addition to when it is cleanly evicted.

We have some experimental (meaning, probably at least partially broken) features intended to support this kind of use case. They're both designed around the observation that HTCondor has no real way of knowing when it's safe to transfer the job sandbox if the job is still running, but that if you're creating checkpoints, your job is going to know how to restart from them.

If you want the job to periodically checkpoint, you can request that HTCondor send a signal to it every so often; when it exits successfully, HTCondor performs file transfer (as if the job had been evicted), but instead of going back into the queue, HTCondor just restarts the job right where it was running.

If the job generates checkpoints on its own, you can also configure HTCondor to recognize, for example, that when the job exits with code 88, that means to perform file transfer (as if the job had been evicted), and then restart the job right where it had been running.

See the following page on our Wiki for details, and do please let me know if either feature works for you. Thanks.

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalSupportForPeriodicCheckpointingInVanillaUniverse

- ToddM