[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] network loss between schedds and worker nodes



Hi Andrew,

I poked around and saw nothing obvious that could cause it.  condor_read's error message doesn't log the timeout used, but I don't see any way that the call could have blocked for multiple hours (hence preventing the lease renewal messages from going out the door).

Can you reveal more of the ShadowLog?  What other messages do you see?

Can you replicate this under "laboratory conditions"?

Brian

On Dec 12, 2013, at 4:14 PM, andrew.lahiff@xxxxxxxxxx wrote:

> Hi,
> 
> We currently have JobLeaseDuration set to 7200 so that if a machine running a schedd needs to be rebooted or is down for other reasons we have some time before jobs would be lost (the default wasn't long enough). This works fine and we clearly see the "Attempting to locate disconnected starter..." and "Reconnect SUCCESS: connection re-established" messages in ShadowLog.
> 
> However, today we had our first experience of a network break between 2 schedds and all the worker nodes. It lasted about 30 minutes. Two hours after the beginning of the break, we had lots of messages like this (*) in ShadowLog, and the running jobs therefore failed from the users point of view. It seems that the shadows didn't notice that the network had returned, and then they all exited after the JobLeaseDuration had expired.
> 
> I assume that if we had done a "service condor restart" on the schedd machines after networking was restored then everything would have been fine, i.e. new shadows would have been created which would then connect to the starters. In general though, is there any way to ensure that a shadow will detect the loss and return of the connection to a starter and automatically reconnect, provided some time limit hasn't been exceeded of course?
> 
> Thanks,
> Andrew.
> 
> (*)
> 12/12/13 16:52:05 (281379.0) (9670): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from starter at <130.246.219.76:54959>.
> 12/12/13 16:52:05 (281379.0) (9670): IO: Failed to read packet header
> 12/12/13 16:52:05 (281379.0) (9670): Can no longer talk to condor_starter <130.246.219.76:54959>
> 12/12/13 16:52:05 (281379.0) (9670): JobLeaseDuration remaining: EXPIRED!
> 12/12/13 16:52:05 (281379.0) (9670): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (7200 seconds) expired
> 12/12/13 16:52:05 (281379.0) (9670): **** condor_shadow (condor_SHADOW) pid 9670 EXITING WITH STATUS 107
> -- 
> Scanned by iCritical.
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/