[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_shadow failed to detect the quickly job which cannot update Shadow



Johnson koil Raj wrote:
Hi

  I have submitted a job in my Condor Pool. It is executing in X system.
Now I have stopped the Network at  system X. But the Shadow in the
Submitter is not detecting it and still showing the Job as running.

can we make shadow not to wait for job like that and change status back
to idle.

By default, Condor will detect the network failure and change the job status back to idle. Unfortunately, it could take up to a maximum of two hours to do so. The reason for this is in the above circumstance Condor uses TCP/IP KEEP ALIVE packets, and the standard for TCP/IP keep alives says it is allowable for a max of two hours to pass in between keep alive pings.

So if it is "good enough" for Condor to take up to two hours to detect the network failure and mark the job as idle, then you are done, since this is what will happen by default.

If it is not good enough, you can configure Condor to send its own keep alive packets between at a time interval that you specify. So you could configure Condor to detect the above network failure within 10 minutes, for instance, and the cost of additional network traffic (for the pings) and overhead on the submit machine. If you need to know how to do this, just ask...

Note we are talking here only about the situation of the execute machine falling off the network; i.e. a power failure of the execute node, or network failure, etc. If the execute machine is shutdown, or rebooted, or Condor processes are killed, etc, then the shadow notices right away and the job switches back to idle.

regards,
Todd

--
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257