07/07/20 02:16:58 ERROR: Child pid 4137244 appears hung! Killing it hard.
07/07/20 02:16:58 DefaultReaper unexpectedly called on pid 4137244, status 9.
07/07/20 02:16:58 The STARTD (pid 4137244) was killed because it was no longer responding
07/07/20 02:30:33 Sending obituary for "/usr/sbin/condor_startd"
07/07/20 02:30:33 restarting /usr/sbin/condor_startd in 10 seconds
07/07/20 02:30:49 DaemonCore: Can't receive command request from <IP ADDRESS> (perhaps a timeout?)
07/07/20 02:30:49 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 145918
07/07/20 02:31:00 Setting ready state 'Ready' for STARTD
Thanks & Regards,
Vikrant Aggarwal
On Tue, Jul 7, 2020 at 8:38 PM Zach Miller <zmiller@xxxxxxxxxxx <mailto:zmiller@xxxxxxxxxxx>> wrote:
Hello,
Is the condor_startd getting restarted by the condor_master on that machine? What does it say in the MasterLog? Or
is all of HTCondor getting restarted by systemd or something?
Also, you can get more debugging information from tools, like condor_who, by:
1) Set environment variable _CONDOR_TOOL_DEBUG=D_ALL:2
2) Use the "-debug" flag, like:Â "condor_who -debug"
Hopefully something in the output from the above ideas will give us a clue. Thanks!
Cheers,
-zach
ïOn 7/7/20, 5:49 AM, "HTCondor-users on behalf of ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>"
<htcondor-users-bounces@xxxxxxxxxxx <mailto:htcondor-users-bounces@xxxxxxxxxxx> on behalf of ervikrant06@xxxxxxxxx
<mailto:ervikrant06@xxxxxxxxx>> wrote:
  Hello Experts,
  We are running dev version 8.5.8 in 3 pools. Since it's a very old unsupportable version, we are working on
upgrading to 8.8.5 ..This version was working fine for a very long period of time until recently we have seen the
condor_startd process getting randomly restarted on some of the nodes. During troubleshooting we found the only
common thing is most of the nodes were running jobs submitted from schedd which was exhibiting very high load avg
but jobs submitted from schedd was distributed everywhere hence not sure whether this is the only cause.
  Messages reported on both where condor_startd restarted and not.
  07/07/20 01:10:55 condor_write(): Socket closed when trying to write 13 bytes to , fd is 15
  07/07/20 01:10:55 Buf::write(): condor_write() failed
  07/07/20 01:10:55 SharedPortEndpoint: failed to send final status (success) for SHARED_PORT_PASS_SOCK
  07/07/20 01:10:55 condor_write(): Socket closed when trying to write 286 bytes to <10.10.10.11:52390
<http://10.10.10.11:52390> <http://10.10.10.11:52390>>, fd is 16
  07/07/20 01:10:55 Buf::write(): condor_write() failed
  07/07/20 01:10:55 SECMAN: Error sending response classad to < 10.10.10.11 :52390>!
  We have also seen following kinda messages in sharedportlog file.
  07/07/20 01:08:03 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by TOOL
on <10.10.10.11:32637 <http://10.10.10.11:32637> <http://10.10.10.11:32637>>
  07/07/20 01:08:03 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by TOOL
on < 10.10.10.11 :7496>
  07/07/20 01:08:04 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by
SCHEDD < 10.10.10.12:9618?addrs= <http://10.10.10.12:9618?addrs=> <http://10.10.10.12:9618?addrs=>
10.10.10.12-9618&noUDP&sock=7179_16e5_3> on < 10.10.10.12:47015 <http://10.10.10.12:47015> <http://10.10.10.12:47015>>
  07/07/20 01:08:35 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by
SCHEDD < 10.10.10.12:9618?addrs= <http://10.10.10.12:9618?addrs=> <http://10.10.10.12:9618?addrs=> 10.10.10.12-9618&n
  oUDP&sock=9654_bbcb_3> on < 10.10.10.12:35369 <http://10.10.10.12:35369> <http://10.10.10.12:35369>>
  These messages are reported on all nodes during this issue no matter where condor_startd restarted or not. I
can't figure out how come only on a few nodes condor_startd restarted not on all where these messages are seen.
condor_who was showing the following error during the time of issue and problematic nodes were not reporting the
status back to HTcondor collector/negotiator (ex. missing from condor_status -compact). I believe restart happened
only for nodes which were not reporting in condor_status -compact. Any thoughts on this issue, what is causing it
and how the condor decided to restart service will be very helpful?
  # condor_who
  Error: communication error
  SECMAN:2007:Failed to end classad message.
  Thanks & Regards,
  Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/