Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] VM Suspend/Resume

Date: Tue, 24 May 2016 23:13:22 +0200
From: Laurence Field <Laurence.Field@xxxxxxx>
Subject: [HTCondor-users] VM Suspend/Resume

Hi Todd,

Here is the summary of how we configured HTCondor so that jobs couldsurvive a VM being suspended for up to 24 hours. The context for this isthe vLHC@home volunteer computing project where machines are usedopportunistically and the VMs they run are suspended, either in memoryor to disk, when the machine is in use by the volunteer.

The Startd is running in the VM that is to be suspended. Firstly thedefault value for MAX_TIME_SKIP needs to be increased from (60*20)s to86400s (24h). This is currently hard-coded so we had to patch thelibrary. Without this the daemons will be restarted if the time skipsmore than 20mins resulting in the jobs being lost.


https://github.com/htcondor/htcondor/blob/master/src/condor_daemon_core.V6/daemon_core.cpp#L52

Next the NOT_RESPONDING_TIMEOUT value needs to be increased from thedefault of 1h to 86400s (24h) to again stop the daemons being restartedif there is a time skip. If this is not done the master detects that thechild has hung.

The CCB_HEARTBEAT_INTERVAL needs to be set to repair closed connectionsand we use 300s rather than the default of 20mins to speed up the repair.

The SEC_DEFAULT_SESSION_LEASE was set to 86400s so that the same ccbidwill be used on reconnection.

On the Collector machine we had to increase the CLASSAD_LIFETIME fromthe default of 15mins to 86400s (24h) so that the Collector would notforget about the VM. The CCB_SWEEP_INTERVAL was increased to86400s(24h). so that connections which may have been closed are notcleaned up prematurely. The SEC_DEFAULT_SESSION_LEASE is also set hereto 86400s (24h).

On the Sched and Sched of the CE which we use, theSEC_DEFAULT_SESSION_LEASE is also set to 86400s (24h). TheALIVE_INTERVAL on the sched was increased from the default of 5 mins to86400s (24h).

Finally in the job route on the CE the JobLeaseDuration was set to86400s (24h) so that the shadow does not die prematurely. We also had toremove the TimerRemove attribute to work around this bughttps://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5470.

By doing all the above we managed to resume a job after suspending (bothin memory and to disk) a VM for nearly 24hours. There are two caveatsthat we have spotted so far. If there is any reason why the Startdlooses contact with the Shadow, it will not be able to send the finalexit code of the job and will then wait until the JobLease has expired,which in this case is a long time! The other is that if the networkenvironment changes during a suspend, e.g. you take your laptop home,the CCB connection will not be established with the same id and theconnection with the Shadow will be lost.

This is by no means a final perfect solution, just one that we managedto get working so comments and suggestions are welcome.


Regards,

Laurence

Follow-Ups:
- Re: [HTCondor-users] VM Suspend/Resume
  - From: Todd Tannenbaum
- Re: [HTCondor-users] VM Suspend/Resume
  - From: MIRON LIVNY

Prev by Date: Re: [HTCondor-users] upgrading central manager from condor 7.6.6 to 8.2.10
Next by Date: [HTCondor-users] Possibility for setting default classad in scheduler
Previous by thread: Re: [HTCondor-users] upgrading central manager from condor 7.6.6 to 8.2.10
Next by thread: Re: [HTCondor-users] VM Suspend/Resume
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] VM Suspend/Resume