[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] schedd 100% CPU utilization



HelloÂExperts,

After updating the setup to 9.0.17 we have noticed that sometimes schedd is using 100% cpu which we believe is causing high value ofÂRecentDaemonCoreDutyCycle eventually not allowing any job match making from the host.Â

When this happens, the following message appears in sched logs repeatedly.

01/09/24 10:01:25 (pid:4135823) Number of Active Workers 5
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 6
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 7
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 8
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 9
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 4
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 5
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 6
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 7
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 8
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 9
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 10

Running pstack showing

# pstack 4135823
#0 Â0x00007f669c646b12 in fork () from /usr/lib64/libc.so.6
#1 Â0x00007f669e7724f9 in ForkWorker::Fork() () from /usr/lib64/libcondor_utils_9_0_17.so
#2 Â0x00007f669e77286a in ForkWork::NewJob() () from /usr/lib64/libcondor_utils_9_0_17.so
#3 Â0x000055bcc1ab6350 in Scheduler::command_query_job_ads(int, Stream*) ()
#4 Â0x00007f669e8d51ba in DaemonCore::CallCommandHandler(int, Stream*, bool, bool, float, float) () from /usr/lib64/libcondor_utils_9_0_17.so
#5 Â0x00007f669e8c2f9a in DaemonCommandProtocol::ExecCommand() () from /usr/lib64/libcondor_utils_9_0_17.so
#6 Â0x00007f669e8c6495 in DaemonCommandProtocol::doProtocol() () from /usr/lib64/libcondor_utils_9_0_17.so
#7 Â0x00007f669e8d0e7b in DaemonCore::HandleReq(Stream*, Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#8 Â0x00007f669e8d100b in DaemonCore::HandleReqAsync(Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#9 Â0x00007f669e880c67 in SharedPortEndpoint::ReceiveSocket(ReliSock*, ReliSock*) () from /usr/lib64/libcondor_utils_9_0_17.so
#10 0x00007f669e880e8b in SharedPortEndpoint::DoListenerAccept(ReliSock*) () from /usr/lib64/libcondor_utils_9_0_17.so
#11 0x00007f669e880f02 in SharedPortEndpoint::HandleListenerAccept(Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#12 0x00007f669e8d42f0 in DaemonCore::CallSocketHandler_worker(int, bool, Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#13 0x00007f669e8d438d in DaemonCore::CallSocketHandler_worker_demarshall(void*) () from /usr/lib64/libcondor_utils_9_0_17.so
#14 0x00007f669e708d05 in CondorThreads::pool_add(void (*)(void*), void*, int*, char const*) () from /usr/lib64/libcondor_utils_9_0_17.so
#15 0x00007f669e8d1727 in DaemonCore::CallSocketHandler(int&, bool) () from /usr/lib64/libcondor_utils_9_0_17.so
#16 0x00007f669e8da4ae in DaemonCore::Driver() () from /usr/lib64/libcondor_utils_9_0_17.so
#17 0x00007f669e8eef22 in dc_main(int, char**) () from /usr/lib64/libcondor_utils_9_0_17.so
#18 0x00007f669c5a3555 in __libc_start_main () from /usr/lib64/libc.so.6
#19 0x000055bcc1a3f9ed in _start ()

After few mins

# pstack 4135823
#0 Â0x00007f669c646b12 in fork () from /usr/lib64/libc.so.6
#1 Â0x00007f669e7724f9 in ForkWorker::Fork() () from /usr/lib64/libcondor_utils_9_0_17.so
#2 Â0x00007f669e77286a in ForkWork::NewJob() () from /usr/lib64/libcondor_utils_9_0_17.so
#3 Â0x000055bcc1ab6350 in Scheduler::command_query_job_ads(int, Stream*) ()
#4 Â0x00007f669e8d51ba in DaemonCore::CallCommandHandler(int, Stream*, bool, bool, float, float) () from /usr/lib64/libcondor_utils_9_0_17.so
#5 Â0x00007f669e8d5a5c in DaemonCore::HandleReqPayloadReady(Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#6 Â0x00007f669e8d42f0 in DaemonCore::CallSocketHandler_worker(int, bool, Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#7 Â0x00007f669e8d438d in DaemonCore::CallSocketHandler_worker_demarshall(void*) () from /usr/lib64/libcondor_utils_9_0_17.so
#8 Â0x00007f669e708d05 in CondorThreads::pool_add(void (*)(void*), void*, int*, char const*) () from /usr/lib64/libcondor_utils_9_0_17.so
#9 Â0x00007f669e8d1727 in DaemonCore::CallSocketHandler(int&, bool) () from /usr/lib64/libcondor_utils_9_0_17.so
#10 0x00007f669e8da4ae in DaemonCore::Driver() () from /usr/lib64/libcondor_utils_9_0_17.so
#11 0x00007f669e8eef22 in dc_main(int, char**) () from /usr/lib64/libcondor_utils_9_0_17.so
#12 0x00007f669c5a3555 in __libc_start_main () from /usr/lib64/libc.so.6
#13 0x000055bcc1a3f9ed in _start ()

We captured the core dump of condor_schedd process when it was showing 99% cpu utilization.Â

(gdb) bt
#0 Â0x00007f669c646b12 in ?? () from /usr/lib64/libc.so.6
#1 Â0x00007f669e7724f9 in ForkWorker::Fork (this=this@entry=0x55bcd438aac0) at /usr/src/debug/condor-9.0.17/src/condor_utils/forkwork.cpp:53
#2 Â0x00007f669e77286a in ForkWork::NewJob (this=0x55bcc1d3f660 <schedd_forker>) at /usr/src/debug/condor-9.0.17/src/condor_utils/forkwork.cpp:198
#3 Â0x000055bcc1ab6350 in Scheduler::command_query_job_ads (this=0x55bcc1d87fe0 <scheduler>, cmd=<optimized out>, stream=0x55bcdab33d40)
  at /usr/src/debug/condor-9.0.17/src/condor_schedd.V6/schedd.cpp:2537
#4 Â0x00007f669e8d51ba in DaemonCore::CallCommandHandler (this=this@entry=0x55bcc293aee0, req=req@entry=516, stream=stream@entry=0x55bcdab33d40,
  delete_stream=delete_stream@entry=false, check_payload=check_payload@entry=false, time_spent_on_sec=time_spent_on_sec@entry=0.000203999996,
  time_spent_waiting_for_payload=time_spent_waiting_for_payload@entry=0.0577820018) at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4468
#5 Â0x00007f669e8d5a5c in DaemonCore::HandleReqPayloadReady (this=this@entry=0x55bcc293aee0, stream=0x55bcdab33d40)
  at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4355
#6 Â0x00007f669e8d42f0 in DaemonCore::CallSocketHandler_worker (this=0x55bcc293aee0, i=12, default_to_HandleCommand=<optimized out>, asock=<optimized out>)
  at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4235
#7 Â0x00007f669e8d438d in DaemonCore::CallSocketHandler_worker_demarshall (arg=0x55bcc6ba7260)
  at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4194
#8 Â0x00007f669e708d05 in CondorThreads::pool_add (routine=routine@entry=0x7f669e8d4370 <DaemonCore::CallSocketHandler_worker_demarshall(void*)>,
  arg=arg@entry=0x55bcc6ba7260, tid=<optimized out>, descrip=<optimized out>) at /usr/src/debug/condor-9.0.17/src/condor_utils/condor_threads.cpp:1109
#9 Â0x00007f669e8d1727 in DaemonCore::CallSocketHandler (this=this@entry=0x55bcc293aee0, i=@0x7fffcf1273a0: 12,
  default_to_HandleCommand=default_to_HandleCommand@entry=true) at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4182
#10 0x00007f669e8da4ae in DaemonCore::Driver (this=0x55bcc293aee0) at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4019
#11 0x00007f669e8eef22 in dc_main (argc=1, argv=0x7fffcf127cb8) at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core_main.cpp:4406
#12 0x00007f669c5a3555 in PrioRecArray () from /usr/lib64/libc.so.6
#13 0x000055bcc1a3f9ed in _start ()


How We fix it:

- Restart the condor service.
- Wait for the sched to calm down, when cpu utilization comes down, daemoncoredutycycle value improves, it starts matchmaking the jobs again.Â

Queries:

- Is this a known issue? I couldn't find anything from release notes.Â
- What else can we capture to find the root cause of the issue?Â


Thanks & Regards,
Vikrant Aggarwal