Re: [HTCondor-users] HTCondor 8.6.13 stack dump

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hello all,

Thank you for your responses!

Dimitri, the nodes are Dell and Supermicro twins with two E5-2680 v4 processors (there are 56 cores using HT per machine but we just offer 48 slots to HTCondor).

Greg, I will send you the log right now, thank you very much.

Brian, yes, you're right, yesterday, I discovered that there were several "condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 988" processes still running in the failing machines, unfortunately, the ps command ends to hung. The

Considering that the crash, according to the core.XXX file, is yesterday at 18:01, I can see several messages like this in ProcLog:

11/14/18 18:04:42 : Procd has a watcher pid and will die if pid 47834 dies.

11/14/18 18:04:42 : Initializing cgroup library.

11/14/18 18:04:42 : taking a snapshot...

11/14/18 18:28:13 : ***********************************

11/14/18 18:28:13 : * condor_procd STARTING UP

11/14/18 18:28:13 : * PID = 50596

11/14/18 18:28:13 : * UID = 0

11/14/18 18:28:13 : * GID = 0

11/14/18 18:28:13 : ***********************************

[...]

Moreover, in the SharedPortLog at 18:01 exactly:

SharedPortCommandSinfuls = "<192.168.100.56:9618>,<[::1]:9618>"

MyAddress = "<192.168.100.56:9618?addrs=192.168.100.56-9618+[--1]-9618&noUDP>"

11/14/18 18:01:11 condor_read() failed: recv(fd=6) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from .

11/14/18 18:01:11 IO: Failed to read packet header

11/14/18 18:01:11 SharedPortClient: failed to receive result for SHARED_PORT_PASS_FD to 45848_7e35 as requested by <192.168.100.56:38473>: Connection reset by peer

11/14/18 18:01:11 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <192.168.100.56:0> (try 1 of 3): CEDAR:6001:Failed to connect to <192.168.100.56:0?sock=45848_7e35>

11/14/18 18:01:11 ChildAliveMsg: giving up because deadline expired for sending DC_CHILDALIVE to parent.

11/14/18 18:01:11 About to update statistics in shared_port daemon ad file at /var/lock/condor/shared_port_ad :

[...]

I will investigate further in this direction.

Thank you again.

Cheers,

Carles

On Wed, 14 Nov 2018 at 21:11, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:

On Nov 14, 2018, at 2:50 AM, Carles Acosta <cacosta@xxxxxx> wrote:

Hello all,

We have recently updated to CentOs7 and HTCondor 8.6.13 our WorkerNodes. For our WNs with more slots, 48, we are seeing that the condor_master is crashing from time to time with this error:

Caught signal 6: si_code=0, si_pid=1, si_uid=0, si_addr=0x1

Signal 6 is SIGABRT -- often something that HTCondor does when some sort of assumption in the code fails.Â It's not actually segfaulting, which is probably a good sign!

From the stack trace, the problem is occurring when it's trying to talk to the procd; hence, there may also be some useful information in the ProcdLog.

Stack dump for process 23309 at timestamp 1542183984 (16 frames)
/usr/lib64/libcondor_utils_8_6_13.so(dprintf_dump_stack+0x24)[0x7ff576c8b2e4]
/usr/lib64/libcondor_utils_8_6_13.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x69)[0x7ff576e15db9]
/usr/lib64/libpthread.so.0(+0xf6d0)[0x7ff5753666d0]
/usr/lib64/libpthread.so.0(read+0x10)[0x7ff5753657e0]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN10DaemonCore9Read_PipeEiPvi+0x6c)[0x7ff576de538c]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN15ProcFamilyProxy11start_procdEv+0x728)[0x7ff576d12e78]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN15ProcFamilyProxyC1EPKc+0x1f8)[0x7ff576d13748]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN19ProcFamilyInterface6createEPKc+0x5e)[0x7ff576c7c4ee]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN10DaemonCore14Create_ProcessEPKcRK7ArgList10priv_stateiiiPK3EnvS1_P10FamilyInfoPP6StreamPiSE_iP10__sigset_tiPmSE_S1_P8MyStringP15FilesystemRemapl+0x19e8)[0x7ff576dff188]
/usr/sbin/condor_master(_ZN6daemon9RealStartEv+0xca3)[0x410303]
/usr/sbin/condor_master(_ZN7Daemons15StartDaemonHereEP6daemon+0x26)[0x410e06]
/usr/sbin/condor_master(_ZN7Daemons15StartAllDaemonsEv+0x68)[0x410e78]
/usr/sbin/condor_master(_Z9main_initiPPc+0x6ff)[0x41583f]
/usr/lib64/libcondor_utils_8_6_13.so(_Z7dc_mainiPPc+0x138d)[0x7ff576e1942d]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff574fac445]
/usr/sbin/condor_master[0x40a6af]

The /var/log/condor directory has a lot of core.XXXX dump files.

This is not happening for the rest of our machines with fewer slots. The WNs can be several days toÂone-week executing jobs without any issue until the crash. Any ideas? Increasing the stack size of the condor user with ulimit does not solve the issue.

Cheers,

Carles

--
Carles Acosta i Silva
PIC (Port d'InformaciÃ CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: http://www.ifae.es/legal.html
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] HTCondor 8.6.13 stack dump