[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Socket between submit and execute hosts closed unexpectedly



Hi,

I started seeing these disconnections lately and I'm not sure where to search for the cause. I checked the shadow, starter and scheduler logs for suspicious messages but haven't found any. How could I find whats causing the connection issue? Or are there any typical sources of this problem?

Cheers,
Szabolcs


The condor_shadow process on the submit host will keep trying to reconnect to the condor_starter on the execute host for the number of seconds specified by job_lease_duration in the submit file; the default is 20 minutes.  See http://goo.gl/XxqlN5

As for how long is the delay between attempts, that is controlled via the (undocumented!) condor_config knobs RECONNECT_BACKOFF_CEILING (default value of 300) and RECONNECT_BACKOFF_FACTOR (default value of 2.0).  The shadow will try to reconnect immediately, and then will do an exponential backoff as specified by the backoff factor until it reaches the ceiling value (try again after 4 second delay, then 8, 16, 32, ..., until it reaches 300 seconds, at which points it will try every 300 seconds until the job_lease_duration expires).

hope the above helps,
Todd