[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_startd restart on some nodes randomly



Many thanks for your reply Todd.

Both scenarios doesn't look to be true in this case. Do you see submitter with very high load avg can cause any issue on the executor node with startd? As per my understanding, if shadow connection is somewhat broken then jobs should get removed but the messages in sharedportlogÂseems toÂindicate that schedd and startdÂwere not able to communicate with each other. This message is a consequence of startd hang or it's a culprit?Â

07/07/20 01:08:35 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by SCHEDDÂ

Thanks & Regards,
Vikrant Aggarwal


On Tue, Jul 7, 2020 at 10:17 PM Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 7/7/2020 10:48 AM, ervikrant06@xxxxxxxxx wrote:
> Hello, Thanks for your response.
>
> Yes, it's showing that startdÂwas restarted by the condor master.. as it was not responding but it happens on few nodes
> and not on all which were running the jobs. Any idea how long master waits for STARTD to respond and hypothesis we can
> draw from current logs?ÂUnfortunatelyÂthis issue is very rarely happens hence it will be difficult to capture debug output.
>

Hi,

By default, the condor_master will wait an hour for the startd to show some sign of life before it kills/restarts it.
This time is controlled by the knob NOT_RESPONDING_TIMEOUT, or specifically for the startd via
STARTD_NOT_RESPONDING_TIMEOUT. See details on these settings in this section of the Manual:

 Âhttps://htcondor.readthedocs.io/en/stable/admin-manual/configuration-macros.html#daemoncore-configuration-file-entries

Most common reasons I've seen this happen is when an overwhelmed filesystem is involved, usually a shared filesystem
like an NFS mount. For instance, on Linux, the condor_startd relies on looking at information in /proc/<pid> to monitor
resource utilization of a job. If a Linux process is blocked on I/O (in 'D' state as reported by /bin/ps) for an
extended period of time as can happen if a shared filesystem is overwhelmed, reading information from
/proc/<pid-in-IO-state> will also hang, which could end up blocking the startd depending on the version of HTCondor
being used. Another situation where I have seen the startd appear to hang is if a job puts a very large number (e.g.
hundreds of thousands) of files into its job scratch space; depending on the underlying file system, this may take the
startd a very long time to clean up. Between HTCondor v8.5.x and v8.8.x we have tried to improve how the startd handles
such scenarios.

Hope the above helps
Todd


> 07/07/20 02:16:58 ERROR: Child pid 4137244 appears hung! Killing it hard.
> 07/07/20 02:16:58 DefaultReaper unexpectedly called on pid 4137244, status 9.
> 07/07/20 02:16:58 The STARTD (pid 4137244) was killed because it was no longer responding
> 07/07/20 02:30:33 Sending obituary for "/usr/sbin/condor_startd"
> 07/07/20 02:30:33 restarting /usr/sbin/condor_startd in 10 seconds
> 07/07/20 02:30:49 DaemonCore: Can't receive command request from <IP ADDRESS> (perhaps a timeout?)
> 07/07/20 02:30:49 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 145918
> 07/07/20 02:31:00 Setting ready state 'Ready' for STARTD
>
>
> Thanks & Regards,
> Vikrant Aggarwal
>
>
> On Tue, Jul 7, 2020 at 8:38 PM Zach Miller <zmiller@xxxxxxxxxxx <mailto:zmiller@xxxxxxxxxxx>> wrote:
>
>Â Â ÂHello,
>
>  ÂIs the condor_startd getting restarted by the condor_master on that machine? What does it say in the MasterLog? Or
>Â Â Âis all of HTCondor getting restarted by systemd or something?
>
>Â Â ÂAlso, you can get more debugging information from tools, like condor_who, by:
>Â Â Â1) Set environment variable _CONDOR_TOOL_DEBUG=D_ALL:2
>Â Â Â2) Use the "-debug" flag, like:Â "condor_who -debug"
>
>  ÂHopefully something in the output from the above ideas will give us a clue. Thanks!
>
>
>Â Â ÂCheers,
>Â Â Â-zach
>
>
>Â Â ÂïOn 7/7/20, 5:49 AM, "HTCondor-users on behalf of ervikrant06@xxxxxxxxx <mailto:ervikrant06@xxxxxxxxx>"
>Â Â Â<htcondor-users-bounces@xxxxxxxxxxx <mailto:htcondor-users-bounces@xxxxxxxxxxx> on behalf of ervikrant06@xxxxxxxxx
>Â Â Â<mailto:ervikrant06@xxxxxxxxx>> wrote:
>
>Â Â Â Â Â Hello Experts,
>Â Â Â Â Â We are running dev version 8.5.8 in 3 pools. Since it's a very old unsupportable version, we are working on
>Â Â Âupgrading to 8.8.5 ..This version was working fine for a very long period of time until recently we have seen the
>Â Â Âcondor_startd process getting randomly restarted on some of the nodes. During troubleshooting we found the only
>Â Â Âcommon thing is most of the nodes were running jobs submitted from schedd which was exhibiting very high load avg
>Â Â Âbut jobs submitted from schedd was distributed everywhere hence not sure whether this is the only cause.
>
>Â Â Â Â Â Messages reported on both where condor_startd restarted and not.
>
>Â Â Â Â Â 07/07/20 01:10:55 condor_write(): Socket closed when trying to write 13 bytes to , fd is 15
>Â Â Â Â Â 07/07/20 01:10:55 Buf::write(): condor_write() failed
>Â Â Â Â Â 07/07/20 01:10:55 SharedPortEndpoint: failed to send final status (success) for SHARED_PORT_PASS_SOCK
>Â Â Â Â Â 07/07/20 01:10:55 condor_write(): Socket closed when trying to write 286 bytes to <10.10.10.11:52390
>Â Â Â<http://10.10.10.11:52390> <http://10.10.10.11:52390>>, fd is 16
>Â Â Â Â Â 07/07/20 01:10:55 Buf::write(): condor_write() failed
>Â Â Â Â Â 07/07/20 01:10:55 SECMAN: Error sending response classad to < 10.10.10.11 :52390>!
>
>
>Â Â Â Â Â We have also seen following kinda messages in sharedportlog file.
>
>Â Â Â Â Â 07/07/20 01:08:03 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by TOOL
>Â Â Âon <10.10.10.11:32637 <http://10.10.10.11:32637> <http://10.10.10.11:32637>>
>Â Â Â Â Â 07/07/20 01:08:03 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by TOOL
>Â Â Âon < 10.10.10.11 :7496>
>Â Â Â Â Â 07/07/20 01:08:04 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by
>Â Â ÂSCHEDD < 10.10.10.12:9618?addrs= <http://10.10.10.12:9618?addrs=> <http://10.10.10.12:9618?addrs=>
>Â Â Â10.10.10.12-9618&noUDP&sock=7179_16e5_3> on < 10.10.10.12:47015 <http://10.10.10.12:47015> <http://10.10.10.12:47015>>
>Â Â Â Â Â 07/07/20 01:08:35 SharedPortClient - server response deadline has passed for 12881_b8dd_3 as requested by
>Â Â ÂSCHEDD < 10.10.10.12:9618?addrs= <http://10.10.10.12:9618?addrs=> <http://10.10.10.12:9618?addrs=> 10.10.10.12-9618&n
>Â Â Â Â Â oUDP&sock=9654_bbcb_3> on < 10.10.10.12:35369 <http://10.10.10.12:35369> <http://10.10.10.12:35369>>
>
>
>Â Â Â Â Â These messages are reported on all nodes during this issue no matter where condor_startd restarted or not. I
>Â Â Âcan't figure out how come only on a few nodes condor_startd restarted not on all where these messages are seen.
>Â Â Âcondor_who was showing the following error during the time of issue and problematic nodes were not reporting the
>Â Â Âstatus back to HTcondor collector/negotiator (ex. missing from condor_status -compact). I believe restart happened
>Â Â Âonly for nodes which were not reporting in condor_status -compact. Any thoughts on this issue, what is causing it
>Â Â Âand how the condor decided to restart service will be very helpful?
>
>Â Â Â Â Â # condor_who
>Â Â Â Â Â Error: communication error
>Â Â Â Â Â SECMAN:2007:Failed to end classad message.
>
>Â Â Â Â Â Thanks & Regards,
>Â Â Â Â Â Vikrant Aggarwal
>
>
>Â Â Â_______________________________________________
>Â Â ÂHTCondor-users mailing list
>Â Â ÂTo unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>Â Â Âsubject: Unsubscribe
>Â Â ÂYou can also unsubscribe by visiting
>Â Â Âhttps://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
>Â Â ÂThe archives can be found at:
>Â Â Âhttps://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing ÂDepartment of Computer Sciences
HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685