[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_startd restart on some nodes randomly



Hello Experts,

We are running dev version 8.5.8 in 3 pools. Since it's a very old unsupportable version, we are working on upgrading to 8.8.5 ..This version was working fine for a very long period of time untilÂrecently we have seen the condor_startd process getting randomly restarted on some of the nodes. During troubleshooting we found the only common thing is most of the nodes were running jobs submitted from schedd which was exhibiting very high load avg but jobs submitted from schedd was distributed everywhere hence not sure whether this is the only cause.Â

Messages reported on both where condor_startd restarted and not.Â

07/07/20 01:10:55 condor_write(): Socket closed when trying to write 13 bytes to , fd is 15
07/07/20 01:10:55 Buf::write(): condor_write() failed
07/07/20 01:10:55 SharedPortEndpoint: failed to send final status (success) for SHARED_PORT_PASS_SOCK
07/07/20 01:10:55 condor_write(): Socket closed when trying to write 286 bytes to <10.10.10.11:52390>, fd is 16
07/07/20 01:10:55 Buf::write(): condor_write() failed
07/07/20 01:10:55 SECMAN: Error sending response classad to < 10.10.10.11 :52390>!

We have also seen following kinda messages in sharedportlog file.Â

07/07/20 01:08:03 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by TOOL on <10.10.10.11:32637>
07/07/20 01:08:03 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by TOOL on < 10.10.10.11 :7496>
07/07/20 01:08:04 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by SCHEDD < 10.10.10.12:9618?addrs= 10.10.10.12-9618&noUDP&sock=7179_16e5_3> on < 10.10.10.12:47015>
07/07/20 01:08:35 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by SCHEDD < 10.10.10.12:9618?addrs= 10.10.10.12-9618&n
oUDP&sock=9654_bbcb_3> on < 10.10.10.12:35369>

These messages are reported on all nodes during this issue no matter where condor_startd restarted or not. I can't figure out how come only on a few nodes condor_startd restarted not on all where these messages are seen. condor_who was showing the following error during the time of issue and problematic nodes were not reporting the status back to HTcondor collector/negotiator (ex. missing from condor_status -compact). I believe restart happened only for nodes which were not reporting in condor_status -compact. Any thoughts on this issue, what is causing it and how the condor decided to restart service will be very helpful?

# condor_who
Error: communication error
SECMAN:2007:Failed to end classad message.

Thanks & Regards,
Vikrant Aggarwal