[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor_shadow failed to detect the quickly job which cannot update Shadow
- Date: Wed, 15 Oct 2008 10:55:12 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] condor_shadow failed to detect the quickly job which cannot update Shadow
Johnson koil Raj wrote:
I have submitted a job in my Condor Pool. It is executing in X system.
Now I have stopped the Network at system X. But the Shadow in the
Submitter is not detecting it and still showing the Job as running.
can we make shadow not to wait for job like that and change status back
By default, Condor will detect the network failure and change the job
status back to idle. Unfortunately, it could take up to a maximum of
two hours to do so. The reason for this is in the above circumstance
Condor uses TCP/IP KEEP ALIVE packets, and the standard for TCP/IP keep
alives says it is allowable for a max of two hours to pass in between
keep alive pings.
So if it is "good enough" for Condor to take up to two hours to detect
the network failure and mark the job as idle, then you are done, since
this is what will happen by default.
If it is not good enough, you can configure Condor to send its own keep
alive packets between at a time interval that you specify. So you could
configure Condor to detect the above network failure within 10 minutes,
for instance, and the cost of additional network traffic (for the pings)
and overhead on the submit machine. If you need to know how to do this,
Note we are talking here only about the situation of the execute machine
falling off the network; i.e. a power failure of the execute node, or
network failure, etc. If the execute machine is shutdown, or rebooted,
or Condor processes are killed, etc, then the shadow notices right away
and the job switches back to idle.
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba@xxxxxxxxxxx 1210 W. Dayton St. Rm #4257