Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor daemons dying on SL7 worker nodes

Date: Wed, 27 Jul 2016 19:59:22 +0000
From: andrew.lahiff@xxxxxxxxxx
Subject: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes

Hi,

I'm experiencing problems with HTCondor 8.5.5 on SL7 running only Docker universe jobs. Occasionally all HTCondor daemons on a worker node do a stack dump and die, including the master, startd and all starters.

An interesting side effect of this is that while HTCondor deletes the job sandboxes, the Docker containers actually continue running, but HTCondor seems unaware of this, and therefore eventually starts running another set of jobs. So I end up with twice as many jobs running as there should be on an affected worker node, half of which are no longer under HTCondor's control.

In /var/log/messages is this (it sometimes happens several times consecutively):

2016-07-27T20:20:09.881844+01:00 lcg1879 systemd: condor.service watchdog timeout (limit 5s)!
2016-07-27T20:20:10.068139+01:00 lcg1879 systemd: condor.service: main process exited, code=killed, status=6/ABRT
2016-07-27T20:20:10.157858+01:00 lcg1879 systemd: Unit condor.service entered failed state.
2016-07-27T20:20:10.158120+01:00 lcg1879 systemd: condor.service failed.
2016-07-27T20:20:15.381427+01:00 lcg1879 systemd: condor.service holdoff time over, scheduling restart.

In /var/log/condor/StartLog is this:

Stack dump for process 27243 at timestamp 1469647209 (9 frames)
/lib64/libcondor_utils_8_5_5.so(dprintf_dump_stack+0x72)[0x7fe4f2118722]
/lib64/libcondor_utils_8_5_5.so(_Z18linux_sig_coredumpi+0x24)[0x7fe4f2276224]
/lib64/libpthread.so.0(+0xf100)[0x7fe4f0ab9100]
/lib64/libc.so.6(__select+0x13)[0x7fe4f07d6993]
/lib64/libcondor_utils_8_5_5.so(_ZN8Selector7executeEv+0xa6)[0x7fe4f2183056]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6DriverEv+0x1052)[0x7fe4f2259ef2]
/lib64/libcondor_utils_8_5_5.so(_Z7dc_mainiPPc+0x13a4)[0x7fe4f2279824]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe4f070ab15]
condor_startd[0x422a79]

while every StarterLog has something like this:

Stack dump for process 1211947 at timestamp 1469647209 (9 frames)
/lib64/libcondor_utils_8_5_5.so(dprintf_dump_stack+0x72)[0x7f94ab25d722]
/lib64/libcondor_utils_8_5_5.so(_Z18linux_sig_coredumpi+0x24)[0x7f94ab3bb224]
/lib64/libpthread.so.0(+0xf100)[0x7f94a9bfe100]
/lib64/libc.so.6(__select+0x13)[0x7f94a991b993]
/lib64/libcondor_utils_8_5_5.so(_ZN8Selector7executeEv+0xa6)[0x7f94ab2c8056]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6DriverEv+0x1052)[0x7f94ab39eef2]
/lib64/libcondor_utils_8_5_5.so(_Z7dc_mainiPPc+0x13a4)[0x7f94ab3be824]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f94a984fb15]
condor_starter[0x4220e9]

and finally the MasterLog:

Stack dump for process 27213 at timestamp 1469647209 (29 frames)
/lib64/libcondor_utils_8_5_5.so(dprintf_dump_stack+0x72)[0x7fb1916f0722]
/lib64/libcondor_utils_8_5_5.so(_Z18linux_sig_coredumpi+0x24)[0x7fb19184e224]
/lib64/libpthread.so.0(+0xf100)[0x7fb190091100]
/lib64/libc.so.6(__poll+0x10)[0x7fb18fdacc20]
/lib64/libresolv.so.2(+0xade7)[0x7fb191b5ade7]
/lib64/libresolv.so.2(__libc_res_nquery+0x18e)[0x7fb191b58cce]
/lib64/libresolv.so.2(__libc_res_nsearch+0x350)[0x7fb191b598b0]
/lib64/libnss_dns.so.2(_nss_dns_gethostbyname4_r+0x103)[0x7fb18f26fc53]
/lib64/libc.so.6(+0xdc1a8)[0x7fb18fd9d1a8]
/lib64/libc.so.6(getaddrinfo+0xfd)[0x7fb18fda086d]
/lib64/libcondor_utils_8_5_5.so(_Z16ipv6_getaddrinfoPKcS0_R17addrinfo_iteratorRK8addrinfo+0x49)[0x7fb191791659]
/lib64/libcondor_utils_8_5_5.so(_Z20resolve_hostname_rawRK8MyString+0x15a)[0x7fb1916c522a]
/lib64/libcondor_utils_8_5_5.so(_Z16resolve_hostnameRK8MyString+0x93)[0x7fb1916c54f3]
/lib64/libcondor_utils_8_5_5.so(_Z16resolve_hostnamePKc+0x1f)[0x7fb1916c596f]
/lib64/libcondor_utils_8_5_5.so(_ZN8IpVerify10fill_tableEPNS_13PermTypeEntryEPcb+0x70c)[0x7fb1917c058c]
/lib64/libcondor_utils_8_5_5.so(_ZN8IpVerify4InitEv+0x12d)[0x7fb1917c092d]
/lib64/libcondor_utils_8_5_5.so(_ZN8IpVerify6VerifyE12DCpermissionRK15condor_sockaddrPKcP8MyStringS7_+0x2b8)[0x7fb1917c2248]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6VerifyEPKc12DCpermissionRK15condor_sockaddrS1_+0x81)[0x7fb19181fc71]
/lib64/libcondor_utils_8_5_5.so(_ZN21DaemonCommandProtocol13VerifyCommandEv+0x45a)[0x7fb19184ac2a]
/lib64/libcondor_utils_8_5_5.so(_ZN21DaemonCommandProtocol10doProtocolEv+0xbd)[0x7fb19184c2fd]
/lib64/libcondor_utils_8_5_5.so(_ZN21DaemonCommandProtocol14SocketCallbackEP6Stream+0x7f)[0x7fb19184c48f]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x694)[0x7fb19182d634]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1d)[0x7fb19182d76d]
/lib64/libcondor_utils_8_5_5.so(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x35)[0x7fb191731865]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore17CallSocketHandlerERib+0x14a)[0x7fb191828eea]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6DriverEv+0x1d7b)[0x7fb191832c1b]
/lib64/libcondor_utils_8_5_5.so(_Z7dc_mainiPPc+0x13a4)[0x7fb191851824]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb18fce2b15]
/usr/sbin/condor_master[0x40a5c1]

Has anyone else seen this? It's not obvious to me from timestamps in the logs if it was systemd that killed all the HTCondor daemons due to the watchdog timeout (I guess it's probably this?) or if everything died first and then systemd notices. It only seems to happen when a worker node is very busy (i.e. I've never seen this happen on idle SL7 worker nodes).

Thanks,
Andrew.

Follow-Ups:
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: Iain Bradford Steers
- Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
  - From: Brian Bockelman

Prev by Date: Re: [HTCondor-users] host based authentication for condor_submit -remote
Next by Date: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
Previous by thread: Re: [HTCondor-users] TR: sl6.8 libcgroup -- bug
Next by thread: Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] HTCondor daemons dying on SL7 worker nodes