[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Access point scale



cool, thanks in advance :)

I remember now that removing anything related to slot-weight calculation was getting some load of the sched as well even more on the EP/startd though  ...

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Dudu Handelman" <duduhandelman@xxxxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Freitag, 26. Januar 2024 12:36:14
Betreff: Re: [HTCondor-users] Access point scale

Thanks Christoph. 
Will verify and update with findings. 

Thanks again 





From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Beyer, Christoph <christoph.beyer@xxxxxxx>
Sent: Friday, January 26, 2024 1:32:27 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Access point scale

Hi Dudu,

I am probably not a big help but I can tell you that a powerfull sched can hold this kind of numbers of jobs but the condor design is not optimized for that.

From my experience on the AP the main bottleneck is the state transaction file 'JOB_QUEUE_LOG' if you have not done so, put it on a fast SSD - it helps a lot.

Also the shared storage is usually a nuisance, especially for the log files which are constantly written by the shadows. Every running job has a shadow that keeps an open file handle for the individual job log file.

If that location is on a shared filesystem it will cause grief !

We ended up running native GPFS on the scheds in order to get  decent responsitivity and overall performance as most of our users use it as a logfile location ...


Maybe this helps a little bit - would be interested in any gain of knowledge you get on this too !!!

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Dudu Handelman" <duduhandelman@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Freitag, 26. Januar 2024 11:36:18
Betreff: [HTCondor-users] Access point scale

Hi All.
We have just added some cores to our cluster now a single user access point might have 40k jobs in a running state. The jobs are short 
Probably 20 minutes some are less. 

I know the basic. 
No swap
File descriptors without a limit
Use a physical server. 
Use nvme/ssd 
Sufficient cores and ram. 

I'm using sharedport that complain that the server was too busy to answer in some cases. 

Sometimes condor_q is not responding.

But the main issue is that while condor_q show 25k running jobs condor_q -run shows that only 15k jobs have a slot. 
Which means that the resource is claimed but not running yet.

Because the jobs are short running it never uses all the resources which it claimed. 

Some extra information 
Using docker universe
Using shared storage
Try to minimize file transfers.
Not streaming outputs or error

What will improve the Performance? Please share from your experience 

Many thanks 
David 




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/