[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] VM Suspend/Resume



Miron,

As far as I am aware, no signal is available in the guest VM. The host either freezes the VM or freezes and writes the state to disk. We just have to assume that Stards that are offline have been suspended and may be resurrected up to a point in time when we will considered them to be expired.

Laurence

On 25/05/16 05:54, MIRON LIVNY wrote:
Laurence,

Thank you for the detailed report.

Would it be possible for the VM to notify the shadow that it is about to suspend? Namely, if we add such a feature for the starter to communicate back to the submitting machine, will the VM have time to do so before suspending?

Miron.

Sent from my iPhone

On May 25, 2016, at 00:14, Laurence Field <Laurence.Field@xxxxxxx> wrote:

Hi Todd,

Here is the summary of how we configured HTCondor so that jobs could survive a VM being suspended for up to 24 hours. The context for this is the vLHC@home volunteer computing project where machines are used opportunistically and the VMs they run are suspended, either in memory or to disk, when the machine is in use by the volunteer.

The Startd is running in the VM that is to be suspended. Firstly the default value for MAX_TIME_SKIP needs to be increased from (60*20)s to 86400s (24h). This is currently hard-coded so we had to patch the library. Without this the daemons will be restarted if the time skips more than 20mins resulting in the jobs being lost.

https://github.com/htcondor/htcondor/blob/master/src/condor_daemon_core.V6/daemon_core.cpp#L52

Next the NOT_RESPONDING_TIMEOUT value needs to be increased from the default of 1h to 86400s (24h) to again stop the daemons being restarted if there is a time skip. If this is not done the master detects that the child has hung.

The CCB_HEARTBEAT_INTERVAL needs to be set to repair closed connections and we use 300s rather than the default of 20mins to speed up the repair.

The SEC_DEFAULT_SESSION_LEASE was set to 86400s so that the same ccbid will be used on reconnection.

On the Collector machine we had to increase the CLASSAD_LIFETIME from the default of 15mins to 86400s (24h) so that the Collector would not forget about the VM.  The CCB_SWEEP_INTERVAL was increased to 86400s(24h). so that connections which may have been closed are not cleaned up prematurely.  The SEC_DEFAULT_SESSION_LEASE is also set here to 86400s (24h).

On the Sched and Sched of the CE which we use, the SEC_DEFAULT_SESSION_LEASE is also set to 86400s (24h). The ALIVE_INTERVAL on the sched was increased from the default of 5 mins to 86400s (24h).

Finally in the job route on the CE the JobLeaseDuration was set to 86400s (24h) so that the shadow does not die prematurely. We also had to remove the TimerRemove attribute to work around this bug https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5470.

By doing all the above we managed to resume a job after suspending (both in memory and to disk) a VM for nearly 24hours. There are two caveats that we have spotted so far. If there is any reason why the Startd looses contact with the Shadow, it will not be able to send the final exit code of the job and will then wait until the JobLease has expired, which in this case is a long time! The other is that if the network environment changes during a suspend, e.g. you take your laptop home, the CCB connection will not be established with the same id and the connection with the Shadow will be lost.

This is by no means a final perfect solution, just one that we managed to get working so comments and suggestions are welcome.

Regards,

Laurence











_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/