[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Schedd dies with an exception when communicating with IPv6 startd



Hi,

Our schedd (version 8.6.1, upgraded to 8.6.3 during debugging) recently started crashing with an exception:
---------------------------------------------
07/06/17 22:46:03 (pid:3651692) (bt:a840:20) Failed to assert (sockProto == objectProto) at /slots/02/dir_3420274/userdir/.tmpTjNgI4/BUILD/condor-8.6.3/src/condor_io/sock.cpp, line 539; aborting.
       Backtrace bt:a840:20 is
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN4Sock12assignSocketEi+0x147) [0x7fe979b7c0a7]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN8ReliSock29exit_reverse_connecting_stateEPS_+0x2a) [0x7fe979b8ed1a]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN9CCBClient22ReverseConnectCallbackEP4Sock+0x68) [0x7fe979b6a8f8]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN9CCBClient28ReverseConnectCommandHandlerEP7ServiceiP6Stream+0x1e7) [0x7fe979b6ae67]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN10DaemonCore18CallCommandHandlerEiP6Streambbff+0x2ce) [0x7fe979bf9a0e]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN21DaemonCommandProtocol11ExecCommandEv+0x1bc) [0x7fe979bda62c]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN21DaemonCommandProtocol10doProtocolEv+0x138) [0x7fe979bda9d8]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN10DaemonCore9HandleReqEP6StreamS1_+0x74) [0x7fe979bf16c4]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN10DaemonCore14HandleReqAsyncEP6Stream+0xb) [0x7fe979bf18bb]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN18SharedPortEndpoint13ReceiveSocketEP8ReliSockS1_+0x243) [0x7fe979b84493]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN18SharedPortEndpoint16DoListenerAcceptEP8ReliSock+0x187) [0x7fe979b84727]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN18SharedPortEndpoint20HandleListenerAcceptEP6Stream+0x4a) [0x7fe979b8478a]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x5f1) [0x7fe979bf82a1]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1d) [0x7fe979bf842d]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x40) [0x7fe979a5e3d0]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN10DaemonCore17CallSocketHandlerERib+0x147) [0x7fe979bf1de7]
       /usr/lib64/libcondor_utils_8_6_3.so(_ZN10DaemonCore6DriverEv+0x36e0) [0x7fe979bf5a00]
       /usr/lib64/libcondor_utils_8_6_3.so(_Z7dc_mainiPPc+0x1799) [0x7fe979c0ca19]
       /lib64/libc.so.6(__libc_start_main+0xfd) [0x33eb61ed1d]
       condor_schedd() [0x422359]
---------------------------------------------

The schedd submits jobs to a large grid glidein pool, and the error happens only when a job is matched to a node in a specific site. The site has IPv6-only compute nodes, while our schedd machine does not support IPv6. We are not 100% sure that the issue is with the IP version, but that seems consistent with the exception (socket protocol != object protocol).

Is this exception expected in such a case? And should the schedd crash?

Cheers,
Yutaro