[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor_collector crashing in HTCondor-CE



Dear all,

I have news about our issue with the condor_collector. We updated HTCondor to the stable version 8.6.4 in our CE. Unfortunately, the error persisted at the beginning: the condor_collector was crashing again after a restart with CAs 1.8.4 and we had to kill all running jobs to ensure that the condor_collector daemon was running again without crashes.

I've been checking Yutaro Iiyama issue and it seems quite similar but affecting to condor_schedd daemon (our error is attached). We are running a dual-stack pool with just one only-IPv6 WN, so, the general configuration of our condor pool is:

ENABLE_IPV4 = auto
ENABLE_IPV6 = auto
PREFER_IPV4 = true

I've added the same lines in the condor-ce configuration. I've tried few condor-ce restart and now it seems stable, but I'm seeing several messages like the ones before the crash:

Failed to send DC_INVALIDATE_KEY to daemon at <IPV4:3196>: SECMAN:2003:TCP connection to daemon at <IPV4:31986> failed.

DC_AUTHENTICATE: attempt to open invalid session ce13:1272309:1501056815:3, failing; this session was requested by <IPV4:27682> with return address <IPV4?addrs=IPV4-21917+[IPV6]-21917&alias=name>

I will report any other problem related to this issue in the future.

Thank you very much.

Cheers,

Carles

-- 
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 22
Fax: +34 93 581 41 10
http://www.pic.es 
AvÃs - Aviso - Legal Notice: http://www.ifae.es/legal.html
07/26/17 09:45:16 (bt:ca2d:20) Failed to assert (sockProto == objectProto) at /slots/02/dir_3629365/userdir/.tmpXjdni0/BUILD/condor-8.6.4/src/condor_io/sock.cpp, line 539; aborting.
        Backtrace bt:ca2d:20 is
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN4Sock12assignSocketEi+0x147) [0x7fb15e6d1d87]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN8ReliSock29exit_reverse_connecting_stateEPS_+0x2a) [0x7fb15e6e49fa]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN9CCBClient22ReverseConnectCallbackEP4Sock+0x68) [0x7fb15e6c05d8]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN9CCBClient28ReverseConnectCommandHandlerEP7ServiceiP6Stream+0x1e7) [0x7fb15e6c0b47]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore18CallCommandHandlerEiP6Streambbff+0x2ce) [0x7fb15e74f6ae]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN21DaemonCommandProtocol11ExecCommandEv+0x1bc) [0x7fb15e7302bc]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN21DaemonCommandProtocol10doProtocolEv+0x138) [0x7fb15e730668]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore9HandleReqEP6StreamS1_+0x74) [0x7fb15e747364]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore14HandleReqAsyncEP6Stream+0xb) [0x7fb15e74755b]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN18SharedPortEndpoint13ReceiveSocketEP8ReliSockS1_+0x243) [0x7fb15e6da173]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN18SharedPortEndpoint16DoListenerAcceptEP8ReliSock+0x187) [0x7fb15e6da407]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN18SharedPortEndpoint20HandleListenerAcceptEP6Stream+0x4a) [0x7fb15e6da46a]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x5f1) [0x7fb15e74df41]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1d) [0x7fb15e74e0cd]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x40) [0x7fb15e5b4270]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore17CallSocketHandlerERib+0x147) [0x7fb15e747a87]
        /usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore6DriverEv+0x36e0) [0x7fb15e74b6a0]
        /usr/lib64/libcondor_utils_8_6_4.so(_Z7dc_mainiPPc+0x1799) [0x7fb15e7626b9]
        /lib64/libc.so.6(__libc_start_main+0xfd) [0x3baa21ed5d]
        condor_collector() [0x40ee79]
Stack dump for process 1068373 at timestamp 1501055116 (25 frames)
/usr/lib64/libcondor_utils_8_6_4.so(dprintf_dump_stack+0x12d)[0x7fb15e639e4d]
/usr/lib64/libcondor_utils_8_6_4.so(_Z18linux_sig_coredumpi+0x40)[0x7fb15e760520]
/lib64/libpthread.so.0[0x3baa60f7e0]
/lib64/libc.so.6(gsignal+0x35)[0x3baa232625]
/lib64/libc.so.6(abort+0x175)[0x3baa233e05]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN4Sock12assignSocketEi+0x155)[0x7fb15e6d1d95]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN8ReliSock29exit_reverse_connecting_stateEPS_+0x2a)[0x7fb15e6e49fa]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN9CCBClient22ReverseConnectCallbackEP4Sock+0x68)[0x7fb15e6c05d8]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN9CCBClient28ReverseConnectCommandHandlerEP7ServiceiP6Stream+0x1e7)[0x7fb15e6c0b47]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore18CallCommandHandlerEiP6Streambbff+0x2ce)[0x7fb15e74f6ae]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN21DaemonCommandProtocol11ExecCommandEv+0x1bc)[0x7fb15e7302bc]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN21DaemonCommandProtocol10doProtocolEv+0x138)[0x7fb15e730668]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore9HandleReqEP6StreamS1_+0x74)[0x7fb15e747364]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore14HandleReqAsyncEP6Stream+0xb)[0x7fb15e74755b]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN18SharedPortEndpoint13ReceiveSocketEP8ReliSockS1_+0x243)[0x7fb15e6da173]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN18SharedPortEndpoint16DoListenerAcceptEP8ReliSock+0x187)[0x7fb15e6da407]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN18SharedPortEndpoint20HandleListenerAcceptEP6Stream+0x4a)[0x7fb15e6da46a]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x5f1)[0x7fb15e74df41]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1d)[0x7fb15e74e0cd]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x40)[0x7fb15e5b4270]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore17CallSocketHandlerERib+0x147)[0x7fb15e747a87]
/usr/lib64/libcondor_utils_8_6_4.so(_ZN10DaemonCore6DriverEv+0x36e0)[0x7fb15e74b6a0]
/usr/lib64/libcondor_utils_8_6_4.so(_Z7dc_mainiPPc+0x1799)[0x7fb15e7626b9]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3baa21ed5d]
condor_collector[0x40ee79]