Re: [HTCondor-users] Access point scale

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

cool, thanks in advance :)

I remember now that removing anything related to slot-weight calculation was getting some load of the sched as well even more on the EP/startd though ...

Best

christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Dudu Handelman" <duduhandelman@xxxxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Freitag, 26. Januar 2024 12:36:14
Betreff: Re: [HTCondor-users] Access point scale

Thanks Christoph.

Will verify and update with findings.

Thanks again

Get Outlook for Android

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Beyer, Christoph <christoph.beyer@xxxxxxx>
Sent: Friday, January 26, 2024 1:32:27 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Access point scale

Hi Dudu,

I am probably not a big help but I can tell you that a powerfull sched can hold this kind of numbers of jobs but the condor design is not optimized for that.

From my experience on the AP the main bottleneck is the state transaction file 'JOB_QUEUE_LOG' if you have not done so, put it on a fast SSD - it helps a lot.

Also the shared storage is usually a nuisance, especially for the log files which are constantly written by the shadows. Every running job has a shadow that keeps an open file handle for the individual job log file.

If that location is on a shared filesystem it will cause grief !

We ended up running native GPFS on the scheds in order to get decent responsitivity and overall performance as most of our users use it as a logfile location ...

Maybe this helps a little bit - would be interested in any gain of knowledge you get on this too !!!

Best

christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Dudu Handelman" <duduhandelman@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Freitag, 26. Januar 2024 11:36:18
Betreff: [HTCondor-users] Access point scale

Hi All.

We have just added some cores to our cluster now a single user access point might have 40k jobs in a running state. The jobs are short

Probably 20 minutes some are less.

I know the basic.

No swap

File descriptors without a limit

Use a physical server.

Use nvme/ssd

Sufficient cores and ram.

I'm using sharedport that complain that the server was too busy to answer in some cases.

Sometimes condor_q is not responding.

But the main issue is that while condor_q show 25k running jobs condor_q -run shows that only 15k jobs have a slot.

Which means that the resource is claimed but not running yet.

Because the jobs are short running it never uses all the resources which it claimed.

Some extra information

Using docker universe

Using shared storage

Try to minimize file transfers.

Not streaming outputs or error

What will improve the Performance? Please share from your experience

Many thanks

David

Get Outlook for Android

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Access point scale