[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Increasing shadow->schedd timeout



Hello.

We are having issues with our network filesystem that causes condor_schedd and condor_shadow to sometimes hang for long periods of time (I suspect when they try to update job logs), which I think causes unnecessary job restarts.

Is it possible to increase the timeout for shadow->schedd connections? We are using 8.3.8.

ShadowLog contains entries like:
attempt to connect to <172.16.223.61:49753> failed: Connection timed out (connect errno = 110). Will keep trying for 300 total seconds (237 to go). attempt to connect to <172.16.223.61:49753> failed: Connection timed out (connect errno = 110). Can't connect to queue manager: CEDAR:6001:Failed to connect to <172.16.223.61:49753>

SchedLog l
ERROR: Child pid 30961 appears hung! Killing it hard.
Shadow pid 30961 successfully killed because it was hung.
Shadow pid 30961 for job 101118639.0 exited with status 4
ERROR: Shadow exited with job exception code!




Thanks,

Vlad