[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Access point scale



On 1/26/24 04:36, Dudu Handelman wrote:


Hi David:

At some point, we'll just need to profile the schedd with bpftrace/strace to know for certain what is going on. Without that, though, an couple of issues, probably you know about them. The first indication that the schedd is overloaded is that the RecentDaemonCoreDuty cycle is approaching 1.0. I assume your schedd is in this neighborhood?

o) As you mentioned, the most important file to put on ssd/nvme is the job_queue.log, but the schedd also writes the user event.log to disk, so you might want to double check that the job event logs are not on a slow disk.

o) Make sure the schedd and shadow do not have D_FULLDEBUG or other very verbose flags in their DEBUG levels.

o) What version of HTCondor are you running? 23.2 has an improvement in the speed of the schedd when running with a large fd limit: https://github.com/htcondor/htcondor/pull/1907

o) When there are a lot of jobs in the queue, condor_q can eat a lot of time out of the schedd. condor_watch_q can show a lot of similar information as condor_q, but without bothering the schedd

-greg