Re: [HTCondor-users] Access point scale

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On 1/26/24 04:36, Dudu Handelman wrote:

Hi David:

At some point, we'll just need to profile the schedd with
bpftrace/strace to know for certain what is going on. Without that,
though, an couple of issues, probably you know about them. The first
indication that the schedd is overloaded is that the
RecentDaemonCoreDuty cycle is approaching 1.0. I assume your schedd is
in this neighborhood?

o) As you mentioned, the most important file to put on ssd/nvme is the
job_queue.log, but the schedd also writes the user event.log to disk, so
you might want to double check that the job event logs are not on a slow
disk.

o) Make sure the schedd and shadow do not have D_FULLDEBUG or other very
verbose flags in their DEBUG levels.

o) What version of HTCondor are you running? 23.2 has an improvement in
the speed of the schedd when running with a large fd limit:
https://github.com/htcondor/htcondor/pull/1907

o) When there are a lot of jobs in the queue, condor_q can eat a lot of
time out of the schedd. condor_watch_q can show a lot of similar
information as condor_q, but without bothering the schedd

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Access point scale