[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Implementing checkpointing via job wrappers



Hello Condor Users,

we're currently looking into expanding our HTCondor setup to include desktop resources (was previously just glideins and dedicated worker nodes) so I'm investigating if/how to best supply checkpointing capabilities. Problem is that our user's workflows depend heavily on shell scripts for flow control and organisational tasks. Is there a suggested procedure to handle such jobs with preempting?

Practically all jobs are run by our own job submission tool, so we can modify its wrapper layer (implemented as a shell script). I was thinking about issuing standalone checkpoints [1] and restoring from checkpoint files if any are present on startup. How must the HTCondor job be setup to fetch these manual checkpoints on eviction and transfer them on restart?

Are there any guides, hints or tutorials for using external checkpointing such as BLCR?

Cheers,
  Max

[1]
http://research.cs.wisc.edu/htcondor/manual/v7.8/4_2HTCondor_s_Checkpoint.html#sec:standalone-ckpt

[2]
https://ftg.lbl.gov/projects/CheckpointRestart/

--
Dipl.-Phys. Max Fischer
Karlsruhe Institute of Technology (KIT)
Steinbuch Centre for Computing (SCC)
Institute of Experimental nuclear Physics (IEKP)
email:  max.fischer@xxxxxxx
phone:  +49 721 608 28328 (SCC)
        +49 721 608 43369 (IEKP)