Re: [HTCondor-users] Scalability of condor

Re: [HTCondor-users] Scalability of condor_credd

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Thu, 29 Jul 2021 16:12:37 +0000

From: John M Knoeller <johnkn@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] Scalability of condor_credd

The number of execute nodes is not really the issue, the Credd has work to do only when a job starts up, so the real issue is how many job starts per second over the whole pool?

If jobs take typically 1 hour, then a 5000 node pool would be starting about 1.3 jobs per second on average, which the credd should be able to handle without difficulty.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Hitchen, Greg (IM&T, Kensington WA) <Greg.Hitchen@xxxxxxxx>
Sent: Thursday, July 29, 2021 3:07 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Scalability of condor_credd

Hi All

I’ve been experimenting with using condor_credd to allow run_as_owner in submit files.

I’ve tested this successfully on a small test pool. Linux central manager (8.8.13), windows submit node,

windows condor_credd node, windows execute nodes, (all 8.8.12). Submit nodes and condor_cred node

are running windows server 2016, submit nodes 8-core 32Gb RAM, condor_credd node 4-core 16Gb RAM

(test credd node – would probably go 8-core 32Gb RAM for production).

All works OK.

Then was able to make it work across 2 pools using the one condor_credd node by making the condor_credd node

report to both pools, i.e.

CONDOR_HOST = test-pool-cm, other-pool-cm

in condor_config on the condor_credd node.

Our production system has 9 pools (with flocking enabled across all) with a total of approximately 2,000+ machines

and 10,000+ slots/cores.

Typically have a maximum of ~5,000 cores available at any one time (user activity, machines off overnight, etc.) and

therefore a max of ~5,000 single core jobs running simultaneously.

Does anyone have a feel for how the single condor_credd node would handle this?

OK?

Sluggish?

Curl up and die?

Thanks for any help/advice/comments/suggestions.

Cheers

Greg

Mailing List Archives

Public Access

Re: [HTCondor-users] Scalability of condor_credd