[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] (Dedicated) Schedd dies after few hours



A couple of parallel jobs, submitted on a DedicatedScheduler machine,
keepe going from Running to Idle state (and sometimes back).
In the MasterLog on that machine, I'm seeing those:

# grep SCHEDD MasterLog  | tail -n13
14-09-25_19:01:59 The SCHEDD (pid 17198) exited with status 4
14-09-26_15:31:06 The SCHEDD (pid 5615) exited with status 4
14-09-26_22:54:39 The SCHEDD (pid 18058) exited with status 4
14-09-26_23:21:01 The SCHEDD (pid 345) exited with status 1
14-09-27_04:36:20 The SCHEDD (pid 2251) exited with status 4
14-09-27_11:20:45 The SCHEDD (pid 4587) exited with status 4
14-09-27_13:00:33 The SCHEDD (pid 19952) exited with status 4
14-09-27_21:37:42 The SCHEDD (pid 1705) exited with status 4
14-09-28_04:23:20 The SCHEDD (pid 23185) exited with status 4
14-09-28_09:16:06 The SCHEDD (pid 32647) exited with status 4
14-09-28_12:24:38 The SCHEDD (pid 4514) exited with status 4
14-09-28_21:49:22 The SCHEDD (pid 29199) exited with status 4
14-09-28_23:06:42 The SCHEDD (pid 19109) exited with status 4

For the last event,
# grep 19109 ScheddLog | tail -n25
14-09-28_23:06:42 (pid:19109) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (6)
14-09-28_23:06:42 (pid:19109) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0048s
14-09-28_23:06:42 (pid:19109) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (6)
14-09-28_23:06:42 (pid:19109) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0002s
14-09-28_23:06:42 (pid:19109) Calling Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> (8)
14-09-28_23:06:42 (pid:19109) Return from Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> 0.0003s
14-09-28_23:06:42 (pid:19109) Calling Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> (8)
14-09-28_23:06:42 (pid:19109) Return from Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> 0.0001s
14-09-28_23:06:42 (pid:19109) Delaying scheduling of parallel jobs because startd query time is long (287) seconds
14-09-28_23:06:42 (pid:19109) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (6)
14-09-28_23:06:42 (pid:19109) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.2058s
14-09-28_23:06:42 (pid:19109) Calling HandleReq <receive_startd_alive> (0) for command 441 (ALIVE) from unauthenticated@unmapped <10.150.88.21:43398>
14-09-28_23:06:42 (pid:19109) Return from HandleReq <receive_startd_alive> (handler: 0.000s, sec: 0.000s, payload: 0.000s)
14-09-28_23:06:42 (pid:19109) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (6)
14-09-28_23:06:42 (pid:19109) Return from Handler <DCMessenger::receiveMsgCallback NEGOTIATE> 0.0002s
14-09-28_23:06:42 (pid:19109) Calling Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> (8)
14-09-28_23:06:42 (pid:19109) Return from Handler <SecManStartCommand::WaitForSocketCallback REQUEST_CLAIM> 0.0003s
14-09-28_23:06:42 (pid:19109) Calling Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> (8)
14-09-28_23:06:42 (pid:19109) Return from Handler <DCMessenger::receiveMsgCallback REQUEST_CLAIM> 0.0001s
14-09-28_23:06:42 (pid:19109) Delaying scheduling of parallel jobs because startd query time is long (287) seconds
14-09-28_23:06:42 (pid:19109) Calling Handler <DCMessenger::receiveMsgCallback NEGOTIATE> (6)
14-09-28_23:06:42 (pid:19109) DedicatedScheduler: negotiator sent match for slot2@xxxxxxxxxxxxxxxxxx, but we've already got it, deleting old one
14-09-28_23:06:42 (pid:19109) ERROR deleting match record for cluster 772
14-09-28_23:06:42 (pid:19109) Allocation node for this cluster doesn't include the match rec
14-09-28_23:06:42 (pid:19109) ERROR "Can't delete m_rec from allocation node for cluster 772" at line 3129 in file /slots/02/dir_5236/userdir/src/condor_schedd.V6/dedicated_scheduler.cpp

A quick browsing shows identical error conditions for the other
events as well - with one exception: the pid 345 that exited with
another status code, it just got as far as a few "deleting old one"
messages, and then threw a stack dump instead:

Stack dump for process 345 at timestamp 1411766461 (25 frames)
/usr/lib/condor/libcondor_utils_8_2_2.so(dprintf_dump_stack+0x6d)[0x7f1bddf8774d]
/usr/lib/condor/libcondor_utils_8_2_2.so(_Z18linux_sig_coredumpi+0x2a)[0x7f1bde05192a]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7f1bd99e0030]
/lib/x86_64-linux-gnu/libc.so.6(+0x121080)[0x7f1bd9767080]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSs4_Rep8_M_cloneERKSaIcEm+0x72)[0x7f1bda145752]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSsC1ERKSs+0x3c)[0x7f1bda145e5c]
/usr/lib/libclassad.so.7(_ZN7classad5Value8CopyFromERKS0_+0x149)[0x7f1bddb9e869]
/usr/lib/libclassad.so.7(_ZN7classad7Literal8CopyFromERKS0_+0x26)[0x7f1bddb8d0e6]
/usr/lib/libclassad.so.7(_ZNK7classad7Literal4CopyEv+0x36)[0x7f1bddb8d136]
/usr/lib/libclassad.so.7(_ZN7classad7ClassAd8CopyFromERKS0_+0x7a)[0x7f1bddb66a1a]
/usr/lib/libclassad.so.7(_ZN7classad7ClassAdaSERKS0_+0xe)[0x7f1bddb66aee]
condor_schedd(_ZN24DedicatedScheddNegotiate18scheduler_getJobAdE7PROC_IDRN14compat_classad7ClassAdE+0x54)[0x4dc784]
condor_schedd(_ZN15ScheddNegotiate7nextJobEv+0x8e)[0x4e966e]
condor_schedd(_ZN15ScheddNegotiate15messageReceivedEP11DCMessengerP4Sock+0x245)[0x4ea165]
/usr/lib/condor/libcondor_utils_8_2_2.so(_ZN5DCMsg19callMessageReceivedEP11DCMessengerP4Sock+0x35)[0x7f1bde019b45]
/usr/lib/condor/libcondor_utils_8_2_2.so(_ZN11DCMessenger7readMsgE18classy_counted_ptrI5DCMsgEP4Sock+0x117)[0x7f1bde019d27]
/usr/lib/condor/libcondor_utils_8_2_2.so(_ZN11DCMessenger18receiveMsgCallbackEP6Stream+0xf3)[0x7f1bde01a893]
/usr/lib/condor/libcondor_utils_8_2_2.so(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x8b2)[0x7f1bde03f692]
/usr/lib/condor/libcondor_utils_8_2_2.so(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1d)[0x7f1bde03f7fd]
/usr/lib/condor/libcondor_utils_8_2_2.so(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x35)[0x7f1bddf32f95]
/usr/lib/condor/libcondor_utils_8_2_2.so(_ZN10DaemonCore17CallSocketHandlerERib+0x142)[0x7f1bde03aca2]
/usr/lib/condor/libcondor_utils_8_2_2.so(_ZN10DaemonCore6DriverEv+0x294d)[0x7f1bde04835d]
/usr/lib/condor/libcondor_utils_8_2_2.so(_Z7dc_mainiPPc+0x12fd)[0x7f1bde05582d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f1bd9664eed]
condor_schedd[0x451b31]

To my naked eye, the frequent events look like accidental mis(re)matching,
but I'm unable to dig any further without assistance. Any suggestions?

- S

--
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}