[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Access point scale



Hi All.
We have just added some cores to our cluster now a single user access point might have 40k jobs in a running state. The jobs are short 
Probably 20 minutes some are less. 

I know the basic. 
No swap
File descriptors without a limit
Use a physical server. 
Use nvme/ssd 
Sufficient cores and ram. 

I'm using sharedport that complain that the server was too busy to answer in some cases. 

Sometimes condor_q is not responding.

But the main issue is that while condor_q show 25k running jobs condor_q -run shows that only 15k jobs have a slot. 
Which means that the resource is claimed but not running yet.

Because the jobs are short running it never uses all the resources which it claimed. 

Some extra information 
Using docker universe
Using shared storage
Try to minimize file transfers.
Not streaming outputs or error

What will improve the Performance? Please share from your experience 

Many thanks 
David 




Get Outlook for Android