[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Access point scale



Thanks Greg. 
If I recall correctly, it's version 9.x
I will update with my findings. 

Thank you very much.
David 



Get Outlook for Android


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, January 26, 2024 6:31:19 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Access point scale

On 1/26/24 04:36, Dudu Handelman wrote:


Hi David:

At some point, we'll just need to profile the schedd with
bpftrace/strace to know for certain what is going on.  Without that,
though, an couple of issues, probably you know about them. The first
indication that the schedd is overloaded is that the
RecentDaemonCoreDuty cycle is approaching 1.0.  I assume your schedd is
in this neighborhood?

o) As you mentioned, the most important file to put on ssd/nvme is the
job_queue.log, but the schedd also writes the user event.log to disk, so
you might want to double check that the job event logs are not on a slow
disk.

o) Make sure the schedd and shadow do not have D_FULLDEBUG or other very
verbose flags in their DEBUG levels.

o) What version of HTCondor are you running?  23.2 has an improvement in
the speed of the schedd when running with a large fd limit:
https://github.com/htcondor/htcondor/pull/1907

o) When there are a lot of jobs in the queue, condor_q can eat a lot of
time out of the schedd.  condor_watch_q can show a lot of similar
information as condor_q, but without bothering the schedd

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/