[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] VM Suspend/Resume



Hi Todd,

Here is the summary of how we configured HTCondor so that jobs could survive a VM being suspended for up to 24 hours. The context for this is the vLHC@home volunteer computing project where machines are used opportunistically and the VMs they run are suspended, either in memory or to disk, when the machine is in use by the volunteer.

The Startd is running in the VM that is to be suspended. Firstly the default value for MAX_TIME_SKIP needs to be increased from (60*20)s to 86400s (24h). This is currently hard-coded so we had to patch the library. Without this the daemons will be restarted if the time skips more than 20mins resulting in the jobs being lost.

https://github.com/htcondor/htcondor/blob/master/src/condor_daemon_core.V6/daemon_core.cpp#L52

Next the NOT_RESPONDING_TIMEOUT value needs to be increased from the default of 1h to 86400s (24h) to again stop the daemons being restarted if there is a time skip. If this is not done the master detects that the child has hung.

The CCB_HEARTBEAT_INTERVAL needs to be set to repair closed connections and we use 300s rather than the default of 20mins to speed up the repair.

The SEC_DEFAULT_SESSION_LEASE was set to 86400s so that the same ccbid will be used on reconnection.

On the Collector machine we had to increase the CLASSAD_LIFETIME from the default of 15mins to 86400s (24h) so that the Collector would not forget about the VM. The CCB_SWEEP_INTERVAL was increased to 86400s(24h). so that connections which may have been closed are not cleaned up prematurely. The SEC_DEFAULT_SESSION_LEASE is also set here to 86400s (24h).

On the Sched and Sched of the CE which we use, the SEC_DEFAULT_SESSION_LEASE is also set to 86400s (24h). The ALIVE_INTERVAL on the sched was increased from the default of 5 mins to 86400s (24h).

Finally in the job route on the CE the JobLeaseDuration was set to 86400s (24h) so that the shadow does not die prematurely. We also had to remove the TimerRemove attribute to work around this bug https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5470.

By doing all the above we managed to resume a job after suspending (both in memory and to disk) a VM for nearly 24hours. There are two caveats that we have spotted so far. If there is any reason why the Startd looses contact with the Shadow, it will not be able to send the final exit code of the job and will then wait until the JobLease has expired, which in this case is a long time! The other is that if the network environment changes during a suspend, e.g. you take your laptop home, the CCB connection will not be established with the same id and the connection with the Shadow will be lost.

This is by no means a final perfect solution, just one that we managed to get working so comments and suggestions are welcome.

Regards,

Laurence