[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] dealing with bad exec nodes



Hi all,

Today we had a node that developed hardware issues (we don't yet know what) while running jobs. From the shadow log:

ShadowLog.old:11/04/16 10:16:31 (8216.19) (2590542): Request to run on slot3@xxxxxxxxxxxxxxxxxxxxxxxx-research.com <10.40.241.131:9618?addrs=10.40.241.131-9618&noUDP&sock=6341_4a87_3> was ACCEPTED
ShadowLog.old:11/04/16 10:16:31 (8216.19) (2590542): File transfer completed successfully.

We shut down the exec node at 12:07, then:

ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd slot3@xxxxxxxxxxxxxxxxxxxxxxxx-research.com.
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): IO: Failed to read packet header
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): Can no longer talk to condor_starter <10.40.241.131:9618>
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): Trying to reconnect to disconnected job
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): LastJobLeaseRenewal: 1478275311 Fri Nov 4 12:01:51 2016
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): JobLeaseDuration: 2400 seconds
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): JobLeaseDuration remaining: 2040
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): Attempting to locate disconnected starter
ShadowLog.old:11/04/16 12:07:51 (8216.19) (2590542): condor_read() failed: recv(fd=3) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from startd slot3@xxxxxxxxxxxxxxxxxxxxxxxx-research.com.

My question is, What can we configure on the hosts to deal with startd's that (seemingly) go silent? I've found various old posts on the list that suggest that there was talk about automatically moving jobs from Running to Idle after exec nodes failed; did anything come of that?
Â
Or is the standard way of dealing with this to use system_periodic_* macros?

Thanks,
Jon