[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Increasing shadow->schedd timeout



On 2/22/2016 12:19 PM, Vladimir Brik wrote:
Hello.

We are having issues with our network filesystem that causes
condor_schedd and condor_shadow to sometimes hang for long periods of
time (I suspect when they try to update job logs), which I think causes
unnecessary job restarts.

Is it possible to increase the timeout for shadow->schedd connections?

Yes. You will want to use knob SHADOW_NOT_RESPONDING_TIMEOUT. See the below entries cut-n-pasted from section 3.3 of the HTCondor Manual.

best regards,
Todd


NOT_RESPONDING_TIMEOUT
When an HTCondor daemon's parent process is another HTCondor daemon, the child daemon will periodically send a short message to its parent stating that it is alive and well. If the parent does not hear from the child for a while, the parent assumes that the child is hung, kills the child, and restarts the child. This parameter controls how long the parent waits before killing the child. It is defined in terms of seconds and defaults to 3600 (1 hour). The child sends its alive and well messages at an interval of one third of this value.

<SUBSYS>_NOT_RESPONDING_TIMEOUT
Identical to NOT_RESPONDING_TIMEOUT, but controls the timeout for a specific type of daemon. For example, SCHEDD_NOT_RESPONDING_TIMEOUT controls how long the condor_schedd's parent daemon will wait without receiving an alive and well message from the condor_schedd before killing it.