[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] network loss between schedds and worker nodes


We currently have JobLeaseDuration set to 7200 so that if a machine running a schedd needs to be rebooted or is down for other reasons we have some time before jobs would be lost (the default wasn't long enough). This works fine and we clearly see the "Attempting to locate disconnected starter..." and "Reconnect SUCCESS: connection re-established" messages in ShadowLog.

However, today we had our first experience of a network break between 2 schedds and all the worker nodes. It lasted about 30 minutes. Two hours after the beginning of the break, we had lots of messages like this (*) in ShadowLog, and the running jobs therefore failed from the users point of view. It seems that the shadows didn't notice that the network had returned, and then they all exited after the JobLeaseDuration had expired.

I assume that if we had done a "service condor restart" on the schedd machines after networking was restored then everything would have been fine, i.e. new shadows would have been created which would then connect to the starters. In general though, is there any way to ensure that a shadow will detect the loss and return of the connection to a starter and automatically reconnect, provided some time limit hasn't been exceeded of course?


12/12/13 16:52:05 (281379.0) (9670): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from starter at <>.
12/12/13 16:52:05 (281379.0) (9670): IO: Failed to read packet header
12/12/13 16:52:05 (281379.0) (9670): Can no longer talk to condor_starter <>
12/12/13 16:52:05 (281379.0) (9670): JobLeaseDuration remaining: EXPIRED!
12/12/13 16:52:05 (281379.0) (9670): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (7200 seconds) expired
12/12/13 16:52:05 (281379.0) (9670): **** condor_shadow (condor_SHADOW) pid 9670 EXITING WITH STATUS 107
Scanned by iCritical.