[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] VM Suspend/Resume

Hi Todd,

On 26/05/16 00:05, Todd Tannenbaum wrote:
On 5/24/2016 4:13 PM, Laurence Field wrote:
Hi Todd,

Hi Laurence, thank you much for this writeup. Definitely thinking on how we can make improvements and address the shortcomings you identified. Meanwhile some questions below....

The CCB_HEARTBEAT_INTERVAL needs to be set to repair closed connections
and we use 300s rather than the default of 20mins to speed up the repair.

^^^ The need for this setting is a little surprising... HTCondor should notice anytime asynchronously if the TCP connection to the CCB server is closed, so I am guessing that when your VM is resumed the kernel does not know the TCP socket is dead? I.e. are TCP sockets that were open at time of suspending the VM are still considered open upon VM resume?
Yes. From what I could see, if the VM is paused the TCP sockets remain open and everything is fine. However, if the VM is suspended to disk the server receives a signal that the connection has been closed but the VM thinks the connection is still open, hence the need for the heartbeat.

The SEC_DEFAULT_SESSION_LEASE was set to 86400s so that the same ccbid
will be used on reconnection.

^^^ This one is also surprising. The only way I can think changing the SEC_DEFAULT_SESSION_LEASE should matter is if upon resuming the VM the system clock still has the incorrect (old) time for a some number of seconds, and during that time HTCondor attempted to reconnect to CCB. In other words, if your VM is suspended for 12 hours, when resuming your VM, is the system clock immediately updated to the correct time before processes can run, or is it possible the VM runs for a bit with a clock showing 12 hours in the past? I don't know if the hypervisor typically takes care of this or it is up to something in the VM (ntpd?) to eventually resync the clock.
Yes, in my tests the clock was not automatically updating so was off by however long the VM was suspended.

On the Sched and Sched of the CE which we use, the
SEC_DEFAULT_SESSION_LEASE is also set to 86400s (24h).

^^^ This seems like it is only required if SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION is not being used, which it is by default starting with v8.5.1 of HTCondor. What version of HTCondor were you using for your tests?
Authentication is being done with a user proxy. The version in the VM is v8.0.6 while the servers are v8.3.8.

ALIVE_INTERVAL on the sched was increased from the default of 5 mins to
86400s (24h).

^^^ This one does not seem like it should matter at all if you are running HTCondor v8.4.0 or above... Did you observe problems without this setting or did you just guess it was needed?
We are not running v8.4.0 or above.

We also had to
remove the TimerRemove attribute to work around this bug

^^^ The bit with TimerRemove should not be necessary if using HTCondor v8.4.4 or above, so hoping your tests were with an earlier version (else we may have another bug to fix).

Again, we are not running v8.4.0 or above.