[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unable to run a standard universe job.



On 6/18/2019 1:29 PM, Michael Murphy wrote:
> With the standard universe being retired(?), will checkpointing be
> availble in other universes?
> 

Yep.

We are already working on decent support for jobs that write out their own checkpoint information 
as described here:
  https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToRunSelfCheckpointingJobs
This support is already available in HTCondor v8.8, but is being made easier to use
in HTCondor v8.9.3 as we have more feedback from beta users.

We are also regularly evaluating the Linux CRIU (Checkpoint/Restore In Userspace) Project,
and once we think it is ready for prime time we will investigate using it to transparently
checkpoint docker universe and/or vanilla universe jobs.  Last we looked at CRIU several
months ago it worked well when doing a checkpoint/restart on the same machine but still
had some work to do in order to restart on a different host.  However, looks like a
new version of CRIU was just released to Fedora, so we will be checking it out again soon.

hope the above helps,
Todd


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685