[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Socket between submit and execute hosts closed unexpectedly



On 5/23/2014 11:29 AM, Szabolcs Horvátth wrote:
Hi,

When the socket between the submit and execute hosts are terminated and
they can't reconnect which attributes specify how many times the
connection is retried and how long is the delay between the tests?
Is it JobLeaseDuration and MAX_CLAIM_ALIVES_MISSED?


The condor_shadow process on the submit host will keep trying to reconnect to the condor_starter on the execute host for the number of seconds specified by job_lease_duration in the submit file; the default is 20 minutes. See http://goo.gl/XxqlN5

As for how long is the delay between attempts, that is controlled via the (undocumented!) condor_config knobs RECONNECT_BACKOFF_CEILING (default value of 300) and RECONNECT_BACKOFF_FACTOR (default value of 2.0). The shadow will try to reconnect immediately, and then will do an exponential backoff as specified by the backoff factor until it reaches the ceiling value (try again after 4 second delay, then 8, 16, 32, ..., until it reaches 300 seconds, at which points it will try every 300 seconds until the job_lease_duration expires).

hope the above helps,
Todd