[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Access point scale



Hi Greg,
I have started to look into it.
Decided to start with a strong physical server 48 cores all nvme 768GB of ram and condor_version 23.0.3.
So submitting 25K jobs was extremely fast getting core claimed  use fast as well.
It took about 10 Minutes for 10K jobs to actually start running so I should deal with the condor_shadow first.

The interesting part is that the job actually running increment at 200 jobs at a time. 
Also the ShadowLog get written every 10 seconds which seems approximately relative to 10K jobs in 10 minutes in batch of 200
I will keep digging,

David,


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Dudu Handelman <duduhandelman@xxxxxxxxxxx>
Sent: 26 January 2024 18:48
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Access point scale
 
Thanks Greg. 
If I recall correctly, it's version 9.x
I will update with my findings. 

Thank you very much.
David 




From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Friday, January 26, 2024 6:31:19 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Greg Thain <gthain@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Access point scale

On 1/26/24 04:36, Dudu Handelman wrote:


Hi David:

At some point, we'll just need to profile the schedd with
bpftrace/strace to know for certain what is going on.  Without that,
though, an couple of issues, probably you know about them. The first
indication that the schedd is overloaded is that the
RecentDaemonCoreDuty cycle is approaching 1.0.  I assume your schedd is
in this neighborhood?

o) As you mentioned, the most important file to put on ssd/nvme is the
job_queue.log, but the schedd also writes the user event.log to disk, so
you might want to double check that the job event logs are not on a slow
disk.

o) Make sure the schedd and shadow do not have D_FULLDEBUG or other very
verbose flags in their DEBUG levels.

o) What version of HTCondor are you running?  23.2 has an improvement in
the speed of the schedd when running with a large fd limit:
https://github.com/htcondor/htcondor/pull/1907

o) When there are a lot of jobs in the queue, condor_q can eat a lot of
time out of the schedd.  condor_watch_q can show a lot of similar
information as condor_q, but without bothering the schedd

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/