[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] condor_shadow timeout when loosing contact with startd



On Tue, Jan 27, 2004 at 01:59:22AM -0600, Derek Wright wrote:
> 
> however, as you've noticed, if the machine is simply powered off or
> the kernel crashes, the socket won't necessarily be closed (at least
> the submit machine end of it won't see it).  in this case, the shadow
> won't notice that the connection has been closed until the TCP stack's
> internal keep alives expire, usually 2 hours.  we do open this socket
> with SO_KEEPALIVE enabled, so at least it times out eventually.  :)
> 
> the good news is that because of some other changes we've made for the
> 6.7.x development series, we're starting to reconsider this.  so, it
> not might be too long before there's a version of condor that will
> have keep alives in the other direction, and you'd be able to
> configure the timeout that the submit machine uses before it gives up
> on a given execute machine.  for now, you're out of luck. :( our
> apologies, and sorry for the potential confusion this thread might
> have caused...
> 

You're not entirely out of luck - you can decrease the system-wide TCP
keepalive timer to be something smaller than 2 hours - on Linux,
it's controlled by the value in 
/proc/sys/net/ipv4/tcp_keepalive_time

-Erik

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>