[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor_collector crashing in HTCondor-CE



Hi Brian,

After killing all jobs and restarting again the condor-ce, condor_collector seems to be stable again (48 hours without crashing). Anyway, we will try to update to 8.6 as you recommend as soon as we can.

Thank you very much.

Cheers,

Carles

El 6 jul. 2017 22:39, "Brian Bockelman" <bbockelm@xxxxxxxxxxx> escribiÃ:
Hi Carles,

This crash doesn't ring a bell. It appears to be in the core HTCondor code, not within the HTCondor-CE pieces.

However, I know there were various problems with HTCondor-CE and HTCondor 8.5.x (unrelated to the problem below). Would an upgrade to the 8.6.x series be possible?

Brian

On Jul 4, 2017, at 9:49 AM, Carles Acosta <cacosta@xxxxxx> wrote:

Dear all,

We are running HTCondor-CE 2.1.2 with HTCondor 8.5.8. Everything was running fine, but after a service condor-ce restart, condor_collector crashes after few seconds:

#########
07/04/17 14:17:02 Failed to send DC_INVALIDATE_KEY to daemon at <188.184.82.78:319
6>: SECMAN:2003:TCP connection to daemon at <188.184.82.78:31986> failed.
07/04/17 14:17:37 Got QUERY_SCHEDD_ADS
07/04/17 14:17:37 (Sending 0 ads in response to query)
07/04/17 14:17:37 Query info: matched=0; skipped=0; query_time=0.000015; send_time
0.000019; type=Scheduler; requirements={( ( Machine =?= "ce13.pic.es" ) )}; peer=<
93.109.175.111:25254>; projection={DaemonStartTime}
07/04/17 14:17:38 DC_AUTHENTICATE: attempt to open invalid session ce13:41414:1499
19515:10, failing; this session was requested by <192.168.100.130:4393> with retur
2-9693#1676832%20188.184.83.197:9693%3faddrs%3d188.184.83.197-9693#2809680&PrivNet
td046.pic.es&addrs=192.168.100.130-10019+[2001-67c-1148-301--46]-10019&noUDP>
07/04/17 14:17:38 (bt:9337:20) Failed to assert (sockProto == objectProto) at /slo
s/06/dir_3214211/userdir/.tmpE5TmSx/BUILD/condor-8.5.8/src/condor_io/sock.cpp, lin
Â539; aborting.
    Backtrace bt:9337:20 is
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN4Sock12assignSocketEi+0x147) [0x37d
0b19b7]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN8ReliSock29exit_reverse_connecting_
tateEPS_+0x2a) [0x37d3076fda]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN9CCBClient22ReverseConnectCallbackE
4Sock+0x68) [0x37d306e258]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN9CCBClient28ReverseConnectCommandHa
dlerEP7ServiceiP6Stream+0x1e7) [0x37d306e7c7]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore18CallCommandHandlerEiP
Streambbff+0x2ce) [0x37d30f4a2e]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN21DaemonCommandProtocol11ExecComman
Ev+0x1bc) [0x37d30dc7fc]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN21DaemonCommandProtocol10doProtocol
v+0x138) [0x37d30dcba8]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore9HandleReqEP6StreamS1_+
x74) [0x37d30ec6f4]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore14HandleReqAsyncEP6Stre
m+0xb) [0x37d30ec8eb]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN18SharedPortEndpoint13ReceiveSocket
P8ReliSockS1_+0x243) [0x37d30a8aa3]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN18SharedPortEndpoint16DoListenerAcc
ptEP8ReliSock+0x187) [0x37d30a8d37]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN18SharedPortEndpoint20HandleListene
AcceptEP6Stream+0x4a) [0x37d30a8d9a]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore24CallSocketHandler_wor
erEibP6Stream+0x5f1) [0x37d30f32c1]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore35CallSocketHandler_wor
er_demarshallEPv+0x1d) [0x37d30f344d]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN13CondorThreads8pool_addEPFvPvES0_P
PKc+0x40) [0x37d2fd7c00]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore17CallSocketHandlerERib
0x147) [0x37d30ece17]
    /usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore6DriverEv+0x36d0) [0x37
30f0a20]
    /usr/lib64/libcondor_utils_8_5_8.so(_Z7dc_mainiPPc+0x1799) [0x37d3111069]
    /lib64/libc.so.6(__libc_start_main+0xfd) [0x3baa21ed5d]
    condor_collector() [0x40ee09]
Stack dump for process 9690 at timestamp 1499170658 (25 frames)
/usr/lib64/libcondor_utils_8_5_8.so(dprintf_dump_stack+0x12d)[0x37d2f91dbd]
/usr/lib64/libcondor_utils_8_5_8.so(_Z18linux_sig_coredumpi+0x40)[0x37d310eed0]
/lib64/libpthread.so.0[0x3baa60f7e0]
/lib64/libc.so.6(gsignal+0x35)[0x3baa232625]
/lib64/libc.so.6(abort+0x175)[0x3baa233e05]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN4Sock12assignSocketEi+0x155)[0x37d30b19c5]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN8ReliSock29exit_reverse_connecting_stateEPS
+0x2a)[0x37d3076fda]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN9CCBClient22ReverseConnectCallbackEP4Sock+0
68)[0x37d306e258]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN9CCBClient28ReverseConnectCommandHandlerEP7
erviceiP6Stream+0x1e7)[0x37d306e7c7]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore18CallCommandHandlerEiP6Streamb
ff+0x2ce)[0x37d30f4a2e]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN21DaemonCommandProtocol11ExecCommandEv+0x1b
)[0x37d30dc7fc]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN21DaemonCommandProtocol10doProtocolEv+0x138
[0x37d30dcba8]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore9HandleReqEP6StreamS1_+0x74)[0x
7d30ec6f4]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore14HandleReqAsyncEP6Stream+0xb)[
x37d30ec8eb]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN18SharedPortEndpoint13ReceiveSocketEP8ReliS
ckS1_+0x243)[0x37d30a8aa3]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN18SharedPortEndpoint16DoListenerAcceptEP8Re
iSock+0x187)[0x37d30a8d37]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN18SharedPortEndpoint20HandleListenerAcceptE
6Stream+0x4a)[0x37d30a8d9a]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore24CallSocketHandler_workerEibP6
tream+0x5f1)[0x37d30f32c1]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore35CallSocketHandler_worker_dema
shallEPv+0x1d)[0x37d30f344d]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x4
)[0x37d2fd7c00]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore17CallSocketHandlerERib+0x147)[
x37d30ece17]
/usr/lib64/libcondor_utils_8_5_8.so(_ZN10DaemonCore6DriverEv+0x36d0)[0x37d30f0a20]
/usr/lib64/libcondor_utils_8_5_8.so(_Z7dc_mainiPPc+0x1799)[0x37d3111069]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3baa21ed5d]
condor_collector[0x40ee09]
#########

We see manyÂÂ"DC_AUTHENTICATE: attempt to open invalid session" messages that we have never seen before. Condor_master is trying to start again the collector but it crashes again with the Failed to assert message.

The only change detected in the CE was updating to new CAs.

So, we don't really know what's happening here. Any ideas?

Thank you in advance.

Cheers,

Carles

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/