[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] condor_shadow timeout when loosing contact with startd



On 26 Jan 2004 16:26:16 -0600  Geoff Lovett wrote:

> So I'd like to get the two hours condor takes to requeue a job onto a
> new box when there's a failure down to maybe 20 minutes.  To reproduce
> the 2 hour timeout behaviour, I'm simply running a job then turning off
> the execute box (to simulate a crash).
> 
> Indeed, the STARTER_UPDATE_INTERVAL hasn't decreased the timeout.

sorry i didn't chime in sooner.  zach's been misleading everyone. :) i
think he's confusing the keep alive messages that the schedd sends to
the startd.  in that case, if the startd hasn't heard a few keep alive
messages, the startd will consider the schedd dead, would kill the job
and advertise itself as available for another job.  in all public
versions of condor, the startd will give up after missing 2 keep alive
messages.  in 6.6.1, you'll be able to configure how many keep alives
the startd will miss before it gives up on the schedd and kills the
job. 

unfortunately, there's no keep alive message sent in the other
direction, nor any acknowledgement of the keep alive (it's just a UDP
packet).  the reason for this is that the shadow has a TCP connection
open to the starter running on the execute machine.  the assumption is
that if anything goes wrong with the execute machine, this socket will
be closed, the shadow will notice, and it can exit right away.  this
is true if the starter crashes, if the starter is killed, the machine
is rebooted, etc.

however, as you've noticed, if the machine is simply powered off or
the kernel crashes, the socket won't necessarily be closed (at least
the submit machine end of it won't see it).  in this case, the shadow
won't notice that the connection has been closed until the TCP stack's
internal keep alives expire, usually 2 hours.  we do open this socket
with SO_KEEPALIVE enabled, so at least it times out eventually.  :)

the good news is that because of some other changes we've made for the
6.7.x development series, we're starting to reconsider this.  so, it
not might be too long before there's a version of condor that will
have keep alives in the other direction, and you'd be able to
configure the timeout that the submit machine uses before it gives up
on a given execute machine.  for now, you're out of luck. :( our
apologies, and sorry for the potential confusion this thread might
have caused...

-derek



Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>