Re: [HTCondor-users] schedd 100% CPU utilization

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On Tue, Jan 9, 2024 at 3:33âPM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

ha,

found the part in the docs:

SCHEDD_QUERY_WORKERS
This specifies the maximum number of concurrent sub-processes that the condor_schedd will spawn to handle queries. The setting is ignored in Windows. In Unix, the default is 8. If the limit is reached, the next query will be handled in the condor_schedd âs main process.

You can set it higher for a trial but it is a bit like a memory leak - it is difficult to fix it with more memory ;)

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Christoph Beyer" <christoph.beyer@xxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 9. Januar 2024 21:29:14
Betreff: Re: [HTCondor-users] schedd 100% CPU utilization

Hi,

from my experience this is in 98% of the cases a high load due to 'condor_q' requests.

Some people have the very bad habit to run 'watch -n 2 condor_q' and also some submit frameworks are heavy in these aspects.

You can adjust the number of workers to be spawned somewhere on the sched but it is worthwhile to check your clients and send out some nasty e-mails ;)

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 9. Januar 2024 17:56:12
Betreff: [HTCondor-users] Âschedd 100% CPU utilization

HelloÂExperts,
After updating the setup to 9.0.17 we have noticed that sometimes schedd is using 100% cpu which we believe is causing high value ofÂRecentDaemonCoreDutyCycle eventually not allowing any job match making from the host.Â

When this happens, the following message appears in sched logs repeatedly.

01/09/24 10:01:25 (pid:4135823) Number of Active Workers 5
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 6
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 7
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 8
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 9
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 4
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 5
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 6
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 7
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 8
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 9
01/09/24 10:01:25 (pid:4135823) Number of Active Workers 10

Running pstack showing

# pstack 4135823
#0 Â0x00007f669c646b12 in fork () from /usr/lib64/libc.so.6
#1 Â0x00007f669e7724f9 in ForkWorker::Fork() () from /usr/lib64/libcondor_utils_9_0_17.so
#2 Â0x00007f669e77286a in ForkWork::NewJob() () from /usr/lib64/libcondor_utils_9_0_17.so
#3 Â0x000055bcc1ab6350 in Scheduler::command_query_job_ads(int, Stream*) ()
#4 Â0x00007f669e8d51ba in DaemonCore::CallCommandHandler(int, Stream*, bool, bool, float, float) () from /usr/lib64/libcondor_utils_9_0_17.so
#5 Â0x00007f669e8c2f9a in DaemonCommandProtocol::ExecCommand() () from /usr/lib64/libcondor_utils_9_0_17.so
#6 Â0x00007f669e8c6495 in DaemonCommandProtocol::doProtocol() () from /usr/lib64/libcondor_utils_9_0_17.so
#7 Â0x00007f669e8d0e7b in DaemonCore::HandleReq(Stream*, Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#8 Â0x00007f669e8d100b in DaemonCore::HandleReqAsync(Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#9 Â0x00007f669e880c67 in SharedPortEndpoint::ReceiveSocket(ReliSock*, ReliSock*) () from /usr/lib64/libcondor_utils_9_0_17.so
#10 0x00007f669e880e8b in SharedPortEndpoint::DoListenerAccept(ReliSock*) () from /usr/lib64/libcondor_utils_9_0_17.so
#11 0x00007f669e880f02 in SharedPortEndpoint::HandleListenerAccept(Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#12 0x00007f669e8d42f0 in DaemonCore::CallSocketHandler_worker(int, bool, Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#13 0x00007f669e8d438d in DaemonCore::CallSocketHandler_worker_demarshall(void*) () from /usr/lib64/libcondor_utils_9_0_17.so
#14 0x00007f669e708d05 in CondorThreads::pool_add(void (*)(void*), void*, int*, char const*) () from /usr/lib64/libcondor_utils_9_0_17.so
#15 0x00007f669e8d1727 in DaemonCore::CallSocketHandler(int&, bool) () from /usr/lib64/libcondor_utils_9_0_17.so
#16 0x00007f669e8da4ae in DaemonCore::Driver() () from /usr/lib64/libcondor_utils_9_0_17.so
#17 0x00007f669e8eef22 in dc_main(int, char**) () from /usr/lib64/libcondor_utils_9_0_17.so
#18 0x00007f669c5a3555 in __libc_start_main () from /usr/lib64/libc.so.6
#19 0x000055bcc1a3f9ed in _start ()

After few mins

# pstack 4135823
#0 Â0x00007f669c646b12 in fork () from /usr/lib64/libc.so.6
#1 Â0x00007f669e7724f9 in ForkWorker::Fork() () from /usr/lib64/libcondor_utils_9_0_17.so
#2 Â0x00007f669e77286a in ForkWork::NewJob() () from /usr/lib64/libcondor_utils_9_0_17.so
#3 Â0x000055bcc1ab6350 in Scheduler::command_query_job_ads(int, Stream*) ()
#4 Â0x00007f669e8d51ba in DaemonCore::CallCommandHandler(int, Stream*, bool, bool, float, float) () from /usr/lib64/libcondor_utils_9_0_17.so
#5 Â0x00007f669e8d5a5c in DaemonCore::HandleReqPayloadReady(Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#6 Â0x00007f669e8d42f0 in DaemonCore::CallSocketHandler_worker(int, bool, Stream*) () from /usr/lib64/libcondor_utils_9_0_17.so
#7 Â0x00007f669e8d438d in DaemonCore::CallSocketHandler_worker_demarshall(void*) () from /usr/lib64/libcondor_utils_9_0_17.so
#8 Â0x00007f669e708d05 in CondorThreads::pool_add(void (*)(void*), void*, int*, char const*) () from /usr/lib64/libcondor_utils_9_0_17.so
#9 Â0x00007f669e8d1727 in DaemonCore::CallSocketHandler(int&, bool) () from /usr/lib64/libcondor_utils_9_0_17.so
#10 0x00007f669e8da4ae in DaemonCore::Driver() () from /usr/lib64/libcondor_utils_9_0_17.so
#11 0x00007f669e8eef22 in dc_main(int, char**) () from /usr/lib64/libcondor_utils_9_0_17.so
#12 0x00007f669c5a3555 in __libc_start_main () from /usr/lib64/libc.so.6
#13 0x000055bcc1a3f9ed in _start ()

We captured the core dump of condor_schedd process when it was showing 99% cpu utilization.Â

(gdb) bt
#0 Â0x00007f669c646b12 in ?? () from /usr/lib64/libc.so.6
#1 Â0x00007f669e7724f9 in ForkWorker::Fork (this=this@entry=0x55bcd438aac0) at /usr/src/debug/condor-9.0.17/src/condor_utils/forkwork.cpp:53
#2 Â0x00007f669e77286a in ForkWork::NewJob (this=0x55bcc1d3f660 <schedd_forker>) at /usr/src/debug/condor-9.0.17/src/condor_utils/forkwork.cpp:198
#3 Â0x000055bcc1ab6350 in Scheduler::command_query_job_ads (this=0x55bcc1d87fe0 <scheduler>, cmd=<optimized out>, stream=0x55bcdab33d40)
Â Â at /usr/src/debug/condor-9.0.17/src/condor_schedd.V6/schedd.cpp:2537
#4 Â0x00007f669e8d51ba in DaemonCore::CallCommandHandler (this=this@entry=0x55bcc293aee0, req=req@entry=516, stream=stream@entry=0x55bcdab33d40,
Â Â delete_stream=delete_stream@entry=false, check_payload=check_payload@entry=false, time_spent_on_sec=time_spent_on_sec@entry=0.000203999996,
Â Â time_spent_waiting_for_payload=time_spent_waiting_for_payload@entry=0.0577820018) at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4468
#5 Â0x00007f669e8d5a5c in DaemonCore::HandleReqPayloadReady (this=this@entry=0x55bcc293aee0, stream=0x55bcdab33d40)
Â Â at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4355
#6 Â0x00007f669e8d42f0 in DaemonCore::CallSocketHandler_worker (this=0x55bcc293aee0, i=12, default_to_HandleCommand=<optimized out>, asock=<optimized out>)
Â Â at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4235
#7 Â0x00007f669e8d438d in DaemonCore::CallSocketHandler_worker_demarshall (arg=0x55bcc6ba7260)
Â Â at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4194
#8 Â0x00007f669e708d05 in CondorThreads::pool_add (routine=routine@entry=0x7f669e8d4370 <DaemonCore::CallSocketHandler_worker_demarshall(void*)>,
Â Â arg=arg@entry=0x55bcc6ba7260, tid=<optimized out>, descrip=<optimized out>) at /usr/src/debug/condor-9.0.17/src/condor_utils/condor_threads.cpp:1109
#9 Â0x00007f669e8d1727 in DaemonCore::CallSocketHandler (this=this@entry=0x55bcc293aee0, i=@0x7fffcf1273a0: 12,
Â Â default_to_HandleCommand=default_to_HandleCommand@entry=true) at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4182
#10 0x00007f669e8da4ae in DaemonCore::Driver (this=0x55bcc293aee0) at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core.cpp:4019
#11 0x00007f669e8eef22 in dc_main (argc=1, argv=0x7fffcf127cb8) at /usr/src/debug/condor-9.0.17/src/condor_daemon_core.V6/daemon_core_main.cpp:4406
#12 0x00007f669c5a3555 in PrioRecArray () from /usr/lib64/libc.so.6
#13 0x000055bcc1a3f9ed in _start ()

How We fix it:

- Restart the condor service.
- Wait for the sched to calm down, when cpu utilization comes down, daemoncoredutycycle value improves, it starts matchmaking the jobs again.Â

Queries:

- Is this a known issue? I couldn't find anything from release notes.Â
- What else can we capture to find the root cause of the issue?Â

Thanks & Regards,

Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] schedd 100% CPU utilization