[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] VM Suspend/Resume
- Date: Wed, 25 May 2016 17:05:50 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] VM Suspend/Resume
On 5/24/2016 4:13 PM, Laurence Field wrote:
Hi Laurence, thank you much for this writeup. Definitely thinking on how
we can make improvements and address the shortcomings you identified.
Meanwhile some questions below....
The CCB_HEARTBEAT_INTERVAL needs to be set to repair closed connections
and we use 300s rather than the default of 20mins to speed up the repair.
^^^ The need for this setting is a little surprising... HTCondor should
notice anytime asynchronously if the TCP connection to the CCB server is
closed, so I am guessing that when your VM is resumed the kernel does
not know the TCP socket is dead? I.e. are TCP sockets that were open at
time of suspending the VM are still considered open upon VM resume?
The SEC_DEFAULT_SESSION_LEASE was set to 86400s so that the same ccbid
will be used on reconnection.
^^^ This one is also surprising. The only way I can think changing the
SEC_DEFAULT_SESSION_LEASE should matter is if upon resuming the VM the
system clock still has the incorrect (old) time for a some number of
seconds, and during that time HTCondor attempted to reconnect to CCB.
In other words, if your VM is suspended for 12 hours, when resuming your
VM, is the system clock immediately updated to the correct time before
processes can run, or is it possible the VM runs for a bit with a clock
showing 12 hours in the past? I don't know if the hypervisor typically
takes care of this or it is up to something in the VM (ntpd?) to
eventually resync the clock.
On the Sched and Sched of the CE which we use, the
SEC_DEFAULT_SESSION_LEASE is also set to 86400s (24h).
^^^ This seems like it is only required if
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION is not being used, which it is
by default starting with v8.5.1 of HTCondor. What version of HTCondor
were you using for your tests?
ALIVE_INTERVAL on the sched was increased from the default of 5 mins to
^^^ This one does not seem like it should matter at all if you are
running HTCondor v8.4.0 or above... Did you observe problems without
this setting or did you just guess it was needed?
We also had to
remove the TimerRemove attribute to work around this bug
^^^ The bit with TimerRemove should not be necessary if using HTCondor
v8.4.4 or above, so hoping your tests were with an earlier version (else
we may have another bug to fix).
Thanks again very much, your testing and investigative work here has
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685