Re: [HTCondor-users] condor_startd restart on some nodes randomly

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On Tue, Jul 7, 2020 at 10:17 PM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:

On 7/7/2020 10:48 AM, ervikrant06@xxxxxxxxx wrote:
> Hello, Thanks for your response.
>
> Yes, it's showing that startdÂwas restarted by the condor master.. as it was not responding but it happens on few nodes
> and not on all which were running the jobs. Any idea how long master waits for STARTD to respond and hypothesis we can
> draw from current logs?ÂUnfortunatelyÂthis issue is very rarely happens hence it will be difficult to capture debug output.
>

Hi,

By default, the condor_master will wait an hour for the startd to show some sign of life before it kills/restarts it.
This time is controlled by the knob NOT_RESPONDING_TIMEOUT, or specifically for the startd via
STARTD_NOT_RESPONDING_TIMEOUT.Â See details on these settings in this section of the Manual:

Â Âhttps://htcondor.readthedocs.io/en/stable/admin-manual/configuration-macros.html#daemoncore-configuration-file-entries

Most common reasons I've seen this happen is when an overwhelmed filesystem is involved, usually a shared filesystem
like an NFS mount.Â For instance, on Linux, the condor_startd relies on looking at information in /proc/<pid> to monitor
resource utilization of a job. If a Linux process is blocked on I/O (in 'D' state as reported by /bin/ps) for an
extended period of time as can happen if a shared filesystem is overwhelmed, reading information from
/proc/<pid-in-IO-state> will also hang, which could end up blocking the startd depending on the version of HTCondor
being used.Â Another situation where I have seen the startd appear to hang is if a job puts a very large number (e.g.
hundreds of thousands) of files into its job scratch space; depending on the underlying file system, this may take the
startd a very long time to clean up.Â Between HTCondor v8.5.x and v8.8.x we have tried to improve how the startd handles
such scenarios.

Hope the above helps
Todd

> 07/07/20 02:16:58 ERROR: Child pid 4137244 appears hung! Killing it hard.
> 07/07/20 02:16:58 DefaultReaper unexpectedly called on pid 4137244, status 9.
> 07/07/20 02:16:58 The STARTD (pid 4137244) was killed because it was no longer responding
> 07/07/20 02:30:33 Sending obituary for "/usr/sbin/condor_startd"
> 07/07/20 02:30:33 restarting /usr/sbin/condor_startd in 10 seconds
> 07/07/20 02:30:49 DaemonCore: Can't receive command request from <IP ADDRESS> (perhaps a timeout?)
> 07/07/20 02:30:49 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 145918
> 07/07/20 02:31:00 Setting ready state 'Ready' for STARTD
>
>
> Thanks & Regards,
> Vikrant Aggarwal
>
>
> On Tue, Jul 7, 2020 at 8:38 PM Zach Miller <zmiller@xxxxxxxxxxx <mailto:zmiller@xxxxxxxxxxx>> wrote:
>
>Â Â ÂHello,
>
>Â Â ÂIs the condor_startd getting restarted by the condor_master on that machine?Â What does it say in the MasterLog?Â Or
>Â Â Âis all of HTCondor getting restarted by systemd or something?
>
>Â Â ÂAlso, you can get more debugging information from tools, like condor_who, by:
>Â Â Â1) Set environment variable _CONDOR_TOOL_DEBUG=D_ALL:2
>Â Â Â2) Use the "-debug" flag, like:Â "condor_who -debug"
>
>Â Â ÂHopefully something in the output from the above ideas will give us a clue.Â Thanks!
>
>
>Â Â ÂCheers,
>Â Â Â-zach
>
>
>Â Â ÂïOn 7/7/20, 5:49 AM, "HTCondor-users on behalf of ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>"
>Â Â Â<htcondor-users-bounces@xxxxxxxxxxx <mailto:htcondor-users-bounces@xxxxxxxxxxx> on behalf of ervikrant06@xxxxxxxxx
>Â Â Â<mailto:ervikrant06@xxxxxxxxx>> wrote:
>
>Â Â Â Â Â Hello Experts,
>Â Â Â Â Â We are running dev version 8.5.8 in 3 pools. Since it's a very old unsupportable version, we are working on
>Â Â Âupgrading to 8.8.5 ..This version was working fine for a very long period of time until recently we have seen the
>Â Â Âcondor_startd process getting randomly restarted on some of the nodes. During troubleshooting we found the only
>Â Â Âcommon thing is most of the nodes were running jobs submitted from schedd which was exhibiting very high load avg
>Â Â Âbut jobs submitted from schedd was distributed everywhere hence not sure whether this is the only cause.
>
>Â Â Â Â Â Messages reported on both where condor_startd restarted and not.
>
>Â Â Â Â Â 07/07/20 01:10:55 condor_write(): Socket closed when trying to write 13 bytes to , fd is 15
>Â Â Â Â Â 07/07/20 01:10:55 Buf::write(): condor_write() failed
>Â Â Â Â Â 07/07/20 01:10:55 SharedPortEndpoint: failed to send final status (success) for SHARED_PORT_PASS_SOCK
>Â Â Â Â Â 07/07/20 01:10:55 condor_write(): Socket closed when trying to write 286 bytes to <10.10.10.11:52390
>Â Â Â<http://10.10.10.11:52390> <http://10.10.10.11:52390>>, fd is 16
>Â Â Â Â Â 07/07/20 01:10:55 Buf::write(): condor_write() failed
>Â Â Â Â Â 07/07/20 01:10:55 SECMAN: Error sending response classad to < 10.10.10.11 :52390>!
>
>
>Â Â Â Â Â We have also seen following kinda messages in sharedportlog file.
>
>Â Â Â Â Â 07/07/20 01:08:03 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by TOOL
>Â Â Âon <10.10.10.11:32637 <http://10.10.10.11:32637> <http://10.10.10.11:32637>>
>Â Â Â Â Â 07/07/20 01:08:03 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by TOOL
>Â Â Âon < 10.10.10.11 :7496>
>Â Â Â Â Â 07/07/20 01:08:04 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by
>Â Â ÂSCHEDD < 10.10.10.12:9618?addrs= <http://10.10.10.12:9618?addrs=> <http://10.10.10.12:9618?addrs=>
>Â Â Â10.10.10.12-9618&noUDP&sock=7179_16e5_3> on < 10.10.10.12:47015 <http://10.10.10.12:47015> <http://10.10.10.12:47015>>
>Â Â Â Â Â 07/07/20 01:08:35 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by
>Â Â ÂSCHEDD < 10.10.10.12:9618?addrs= <http://10.10.10.12:9618?addrs=> <http://10.10.10.12:9618?addrs=> 10.10.10.12-9618&n
>Â Â Â Â Â oUDP&sock=9654_bbcb_3> on < 10.10.10.12:35369 <http://10.10.10.12:35369> <http://10.10.10.12:35369>>
>
>
>Â Â Â Â Â These messages are reported on all nodes during this issue no matter where condor_startd restarted or not. I
>Â Â Âcan't figure out how come only on a few nodes condor_startd restarted not on all where these messages are seen.
>Â Â Âcondor_who was showing the following error during the time of issue and problematic nodes were not reporting the
>Â Â Âstatus back to HTcondor collector/negotiator (ex. missing from condor_status -compact). I believe restart happened
>Â Â Âonly for nodes which were not reporting in condor_status -compact. Any thoughts on this issue, what is causing it
>Â Â Âand how the condor decided to restart service will be very helpful?
>
>Â Â Â Â Â # condor_who
>Â Â Â Â Â Error: communication error
>Â Â Â Â Â SECMAN:2007:Failed to end classad message.
>
>Â Â Â Â Â Thanks & Regards,
>Â Â Â Â Â Vikrant Aggarwal
>
>
>Â Â Â_______________________________________________
>Â Â ÂHTCondor-users mailing list
>Â Â ÂTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>Â Â Âsubject: Unsubscribe
>Â Â ÂYou can also unsubscribe by visiting
>Â Â Âhttps://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
>Â Â ÂThe archives can be found at:
>Â Â Âhttps://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>

--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput ComputingÂ ÂDepartment of Computer Sciences
HTCondor Technical LeadÂ Â Â Â Â Â Â Â 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685

Mailing List Archives

Public Access

Re: [HTCondor-users] condor_startd restart on some nodes randomly