[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor_shadow failed to detect the quickly job which cannot update Shadow
- Date: Thu, 30 Oct 2008 20:24:47 +0530
- From: Johnson koil Raj <johnson.raj@xxxxxxxxx>
- Subject: Re: [Condor-users] condor_shadow failed to detect the quickly job which cannot update Shadow
As per your replay for my query about Condor_Shadow failing.
you said there is a way to detect it by configuring Condor to send its
own keep alive packets between specified time interval.can you please
tell about that configuration you are mentioning here.
please help me to figure out this issue.
On Wed, 2008-10-15 at 10:55 -0500, Todd Tannenbaum wrote:
> Johnson koil Raj wrote:
> > Hi
> > I have submitted a job in my Condor Pool. It is executing in X system.
> > Now I have stopped the Network at system X. But the Shadow in the
> > Submitter is not detecting it and still showing the Job as running.
> > can we make shadow not to wait for job like that and change status back
> > to idle.
> By default, Condor will detect the network failure and change the job
> status back to idle. Unfortunately, it could take up to a maximum of
> two hours to do so. The reason for this is in the above circumstance
> Condor uses TCP/IP KEEP ALIVE packets, and the standard for TCP/IP keep
> alives says it is allowable for a max of two hours to pass in between
> keep alive pings.
> So if it is "good enough" for Condor to take up to two hours to detect
> the network failure and mark the job as idle, then you are done, since
> this is what will happen by default.
> If it is not good enough, you can configure Condor to send its own keep
> alive packets between at a time interval that you specify. So you could
> configure Condor to detect the above network failure within 10 minutes,
> for instance, and the cost of additional network traffic (for the pings)
> and overhead on the submit machine. If you need to know how to do this,
> just ask...
> Note we are talking here only about the situation of the execute machine
> falling off the network; i.e. a power failure of the execute node, or
> network failure, etc. If the execute machine is shutdown, or rebooted,
> or Condor processes are killed, etc, then the shadow notices right away
> and the job switches back to idle.
Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.
WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.