Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] condor_shadow timeout when loosing contact withstartd

Date: 27 Jan 2004 10:28:06 -0600
From: Geoff Lovett <geoff.lovett@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [condor-users] condor_shadow timeout when loosing contact withstartd

Thanks for the info...  I may try to decrease the systemwide keepalive
on my test box, but I'm not sure I want to use something low on my
production boxes.  An condor-internal keepalive would be most useful!

Thanks,
Geoff

On Tue, 2004-01-27 at 01:59, Derek Wright wrote:
> On 26 Jan 2004 16:26:16 -0600  Geoff Lovett wrote:
> 
> > So I'd like to get the two hours condor takes to requeue a job onto a
> > new box when there's a failure down to maybe 20 minutes.  To reproduce
> > the 2 hour timeout behaviour, I'm simply running a job then turning off
> > the execute box (to simulate a crash).
> > 
> > Indeed, the STARTER_UPDATE_INTERVAL hasn't decreased the timeout.
> 
> sorry i didn't chime in sooner.  zach's been misleading everyone. :) i
> think he's confusing the keep alive messages that the schedd sends to
> the startd.  in that case, if the startd hasn't heard a few keep alive
> messages, the startd will consider the schedd dead, would kill the job
> and advertise itself as available for another job.  in all public
> versions of condor, the startd will give up after missing 2 keep alive
> messages.  in 6.6.1, you'll be able to configure how many keep alives
> the startd will miss before it gives up on the schedd and kills the
> job. 
> 
> unfortunately, there's no keep alive message sent in the other
> direction, nor any acknowledgement of the keep alive (it's just a UDP
> packet).  the reason for this is that the shadow has a TCP connection
> open to the starter running on the execute machine.  the assumption is
> that if anything goes wrong with the execute machine, this socket will
> be closed, the shadow will notice, and it can exit right away.  this
> is true if the starter crashes, if the starter is killed, the machine
> is rebooted, etc.
> 
> however, as you've noticed, if the machine is simply powered off or
> the kernel crashes, the socket won't necessarily be closed (at least
> the submit machine end of it won't see it).  in this case, the shadow
> won't notice that the connection has been closed until the TCP stack's
> internal keep alives expire, usually 2 hours.  we do open this socket
> with SO_KEEPALIVE enabled, so at least it times out eventually.  :)
> 
> the good news is that because of some other changes we've made for the
> 6.7.x development series, we're starting to reconsider this.  so, it
> not might be too long before there's a version of condor that will
> have keep alives in the other direction, and you'd be able to
> configure the timeout that the submit machine uses before it gives up
> on a given execute machine.  for now, you're out of luck. :( our
> apologies, and sorry for the potential confusion this thread might
> have caused...
> 
> -derek
> 
> 
> 
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>

References:
- Re: [condor-users] condor_shadow timeout when loosing contact with startd
  - From: Derek Wright

Prev by Date: Re: [condor-users] condor_shadow timeout when loosing contact with startd
Next by Date: [condor-users] diff copy_to_spool transfer_executable
Previous by thread: Re: [condor-users] condor_shadow timeout when loosing contact with startd
Next by thread: [condor-users] Ignoring certian users...
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [condor-users] condor_shadow timeout when loosing contact withstartd