[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Scalability of condor_credd



The number of execute nodes is not really the issue,  the Credd has work to do only when a job starts up, so the real issue is how many job starts per second over the whole pool?

If jobs take typically 1 hour, then a 5000 node pool would be starting about 1.3 jobs per second on average, which the credd should be able to handle without difficulty. 

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Hitchen, Greg (IM&T, Kensington WA) <Greg.Hitchen@xxxxxxxx>
Sent: Thursday, July 29, 2021 3:07 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Scalability of condor_credd
 

Hi All

 

I’ve been experimenting with using condor_credd to allow run_as_owner in submit files.

 

I’ve tested this successfully on a small test pool. Linux central manager (8.8.13), windows submit node,

windows condor_credd node, windows execute nodes, (all 8.8.12). Submit nodes and condor_cred node

are running windows server 2016, submit nodes 8-core 32Gb RAM, condor_credd node 4-core 16Gb RAM

(test credd node – would probably go 8-core 32Gb RAM for production).

 

All works OK.

 

Then was able to make it work across 2 pools using the one condor_credd node by making the condor_credd node

report to both pools, i.e.

CONDOR_HOST = test-pool-cm, other-pool-cm

in condor_config on the condor_credd node.

 

Our production system has 9 pools (with flocking enabled across all) with a total of approximately 2,000+ machines

and 10,000+ slots/cores.

 

Typically have a maximum of ~5,000 cores available at any one time (user activity, machines off overnight, etc.) and

therefore a max of ~5,000 single core jobs running simultaneously.

 

Does anyone have a feel for how the single condor_credd node would handle this?

OK?

Sluggish?

Curl up and die?

 

Thanks for any help/advice/comments/suggestions.

 

Cheers

 

Greg