Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Access point scale

Date: Fri, 26 Jan 2024 10:30:23 -0600
From: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Access point scale

On 1/26/24 04:36, Dudu Handelman wrote:


Hi David:

At some point, we'll just need to profile the schedd withbpftrace/strace to know for certain what is going on.Â Without that,though, an couple of issues, probably you know about them. The firstindication that the schedd is overloaded is that theRecentDaemonCoreDuty cycle is approaching 1.0.Â I assume your schedd isin this neighborhood?

o) As you mentioned, the most important file to put on ssd/nvme is thejob_queue.log, but the schedd also writes the user event.log to disk, soyou might want to double check that the job event logs are not on a slowdisk.

o) Make sure the schedd and shadow do not have D_FULLDEBUG or other veryverbose flags in their DEBUG levels.

o) What version of HTCondor are you running?Â 23.2 has an improvement inthe speed of the schedd when running with a large fd limit:https://github.com/htcondor/htcondor/pull/1907

o) When there are a lot of jobs in the queue, condor_q can eat a lot oftime out of the schedd.Â condor_watch_q can show a lot of similarinformation as condor_q, but without bothering the schedd


-greg

Follow-Ups:
- Re: [HTCondor-users] Access point scale
  - From: Dudu Handelman

References:
- [HTCondor-users] Access point scale
  - From: Dudu Handelman

Prev by Date: Re: [HTCondor-users] Need to understand job-transform files
Next by Date: Re: [HTCondor-users] Access point scale
Previous by thread: Re: [HTCondor-users] Access point scale
Next by thread: Re: [HTCondor-users] Access point scale
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Access point scale