[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor daemons dying on SL7 worker nodes



Hi,

I'm experiencing problems with HTCondor 8.5.5 on SL7 running only Docker universe jobs. Occasionally all HTCondor daemons on a worker node do a stack dump and die, including the master, startd and all starters.

An interesting side effect of this is that while HTCondor deletes the job sandboxes, the Docker containers actually continue running, but HTCondor seems unaware of this, and therefore eventually starts running another set of jobs. So I end up with twice as many jobs running as there should be on an affected worker node, half of which are no longer under HTCondor's control.

In /var/log/messages is this (it sometimes happens several times consecutively):

2016-07-27T20:20:09.881844+01:00 lcg1879 systemd: condor.service watchdog timeout (limit 5s)!
2016-07-27T20:20:10.068139+01:00 lcg1879 systemd: condor.service: main process exited, code=killed, status=6/ABRT
2016-07-27T20:20:10.157858+01:00 lcg1879 systemd: Unit condor.service entered failed state.
2016-07-27T20:20:10.158120+01:00 lcg1879 systemd: condor.service failed.
2016-07-27T20:20:15.381427+01:00 lcg1879 systemd: condor.service holdoff time over, scheduling restart.

In /var/log/condor/StartLog is this:

Stack dump for process 27243 at timestamp 1469647209 (9 frames)
/lib64/libcondor_utils_8_5_5.so(dprintf_dump_stack+0x72)[0x7fe4f2118722]
/lib64/libcondor_utils_8_5_5.so(_Z18linux_sig_coredumpi+0x24)[0x7fe4f2276224]
/lib64/libpthread.so.0(+0xf100)[0x7fe4f0ab9100]
/lib64/libc.so.6(__select+0x13)[0x7fe4f07d6993]
/lib64/libcondor_utils_8_5_5.so(_ZN8Selector7executeEv+0xa6)[0x7fe4f2183056]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6DriverEv+0x1052)[0x7fe4f2259ef2]
/lib64/libcondor_utils_8_5_5.so(_Z7dc_mainiPPc+0x13a4)[0x7fe4f2279824]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe4f070ab15]
condor_startd[0x422a79]

while every StarterLog has something like this:

Stack dump for process 1211947 at timestamp 1469647209 (9 frames)
/lib64/libcondor_utils_8_5_5.so(dprintf_dump_stack+0x72)[0x7f94ab25d722]
/lib64/libcondor_utils_8_5_5.so(_Z18linux_sig_coredumpi+0x24)[0x7f94ab3bb224]
/lib64/libpthread.so.0(+0xf100)[0x7f94a9bfe100]
/lib64/libc.so.6(__select+0x13)[0x7f94a991b993]
/lib64/libcondor_utils_8_5_5.so(_ZN8Selector7executeEv+0xa6)[0x7f94ab2c8056]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6DriverEv+0x1052)[0x7f94ab39eef2]
/lib64/libcondor_utils_8_5_5.so(_Z7dc_mainiPPc+0x13a4)[0x7f94ab3be824]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f94a984fb15]
condor_starter[0x4220e9]

and finally the MasterLog:

Stack dump for process 27213 at timestamp 1469647209 (29 frames)
/lib64/libcondor_utils_8_5_5.so(dprintf_dump_stack+0x72)[0x7fb1916f0722]
/lib64/libcondor_utils_8_5_5.so(_Z18linux_sig_coredumpi+0x24)[0x7fb19184e224]
/lib64/libpthread.so.0(+0xf100)[0x7fb190091100]
/lib64/libc.so.6(__poll+0x10)[0x7fb18fdacc20]
/lib64/libresolv.so.2(+0xade7)[0x7fb191b5ade7]
/lib64/libresolv.so.2(__libc_res_nquery+0x18e)[0x7fb191b58cce]
/lib64/libresolv.so.2(__libc_res_nsearch+0x350)[0x7fb191b598b0]
/lib64/libnss_dns.so.2(_nss_dns_gethostbyname4_r+0x103)[0x7fb18f26fc53]
/lib64/libc.so.6(+0xdc1a8)[0x7fb18fd9d1a8]
/lib64/libc.so.6(getaddrinfo+0xfd)[0x7fb18fda086d]
/lib64/libcondor_utils_8_5_5.so(_Z16ipv6_getaddrinfoPKcS0_R17addrinfo_iteratorRK8addrinfo+0x49)[0x7fb191791659]
/lib64/libcondor_utils_8_5_5.so(_Z20resolve_hostname_rawRK8MyString+0x15a)[0x7fb1916c522a]
/lib64/libcondor_utils_8_5_5.so(_Z16resolve_hostnameRK8MyString+0x93)[0x7fb1916c54f3]
/lib64/libcondor_utils_8_5_5.so(_Z16resolve_hostnamePKc+0x1f)[0x7fb1916c596f]
/lib64/libcondor_utils_8_5_5.so(_ZN8IpVerify10fill_tableEPNS_13PermTypeEntryEPcb+0x70c)[0x7fb1917c058c]
/lib64/libcondor_utils_8_5_5.so(_ZN8IpVerify4InitEv+0x12d)[0x7fb1917c092d]
/lib64/libcondor_utils_8_5_5.so(_ZN8IpVerify6VerifyE12DCpermissionRK15condor_sockaddrPKcP8MyStringS7_+0x2b8)[0x7fb1917c2248]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6VerifyEPKc12DCpermissionRK15condor_sockaddrS1_+0x81)[0x7fb19181fc71]
/lib64/libcondor_utils_8_5_5.so(_ZN21DaemonCommandProtocol13VerifyCommandEv+0x45a)[0x7fb19184ac2a]
/lib64/libcondor_utils_8_5_5.so(_ZN21DaemonCommandProtocol10doProtocolEv+0xbd)[0x7fb19184c2fd]
/lib64/libcondor_utils_8_5_5.so(_ZN21DaemonCommandProtocol14SocketCallbackEP6Stream+0x7f)[0x7fb19184c48f]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x694)[0x7fb19182d634]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1d)[0x7fb19182d76d]
/lib64/libcondor_utils_8_5_5.so(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x35)[0x7fb191731865]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore17CallSocketHandlerERib+0x14a)[0x7fb191828eea]
/lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6DriverEv+0x1d7b)[0x7fb191832c1b]
/lib64/libcondor_utils_8_5_5.so(_Z7dc_mainiPPc+0x13a4)[0x7fb191851824]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb18fce2b15]
/usr/sbin/condor_master[0x40a5c1]

Has anyone else seen this? It's not obvious to me from timestamps in the logs if it was systemd that killed all the HTCondor daemons due to the watchdog timeout (I guess it's probably this?) or if everything died first and then systemd notices. It only seems to happen when a worker node is very busy (i.e. I've never seen this happen on idle SL7 worker nodes).

Thanks,
Andrew.