[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] VM Suspend/Resume



On 5/24/2016 4:13 PM, Laurence Field wrote:
Hi Todd,


Hi Laurence, thank you much for this writeup. Definitely thinking on how we can make improvements and address the shortcomings you identified. Meanwhile some questions below....


The CCB_HEARTBEAT_INTERVAL needs to be set to repair closed connections
and we use 300s rather than the default of 20mins to speed up the repair.


^^^ The need for this setting is a little surprising... HTCondor should notice anytime asynchronously if the TCP connection to the CCB server is closed, so I am guessing that when your VM is resumed the kernel does not know the TCP socket is dead? I.e. are TCP sockets that were open at time of suspending the VM are still considered open upon VM resume?

The SEC_DEFAULT_SESSION_LEASE was set to 86400s so that the same ccbid
will be used on reconnection.


^^^ This one is also surprising. The only way I can think changing the SEC_DEFAULT_SESSION_LEASE should matter is if upon resuming the VM the system clock still has the incorrect (old) time for a some number of seconds, and during that time HTCondor attempted to reconnect to CCB. In other words, if your VM is suspended for 12 hours, when resuming your VM, is the system clock immediately updated to the correct time before processes can run, or is it possible the VM runs for a bit with a clock showing 12 hours in the past? I don't know if the hypervisor typically takes care of this or it is up to something in the VM (ntpd?) to eventually resync the clock.

On the Sched and Sched of the CE which we use, the
SEC_DEFAULT_SESSION_LEASE is also set to 86400s (24h).

^^^ This seems like it is only required if SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION is not being used, which it is by default starting with v8.5.1 of HTCondor. What version of HTCondor were you using for your tests?

The
ALIVE_INTERVAL on the sched was increased from the default of 5 mins to
86400s (24h).


^^^ This one does not seem like it should matter at all if you are running HTCondor v8.4.0 or above... Did you observe problems without this setting or did you just guess it was needed?

We also had to
remove the TimerRemove attribute to work around this bug
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5470.


^^^ The bit with TimerRemove should not be necessary if using HTCondor v8.4.4 or above, so hoping your tests were with an earlier version (else we may have another bug to fix).

Thanks again very much, your testing and investigative work here has been invaluable!

best,
Todd

--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685