Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[HTCondor-users] Access point scale

Hi All.

We have just added some cores to our cluster now a single user access point might have 40k jobs in a running state. The jobs are short

Probably 20 minutes some are less.

I know the basic.

No swap

File descriptors without a limit

Use a physical server.

Use nvme/ssd

Sufficient cores and ram.

I'm using sharedport that complain that the server was too busy to answer in some cases.

Sometimes condor_q is not responding.

But the main issue is that while condor_q show 25k running jobs condor_q -run shows that only 15k jobs have a slot.

Which means that the resource is claimed but not running yet.

Because the jobs are short running it never uses all the resources which it claimed.

Some extra information

Using docker universe

Using shared storage

Try to minimize file transfers.

Not streaming outputs or error

What will improve the Performance? Please share from your experience

Many thanks

David

Follow-Ups:
- Re: [HTCondor-users] Access point scale
  - From: Greg Thain
- Re: [HTCondor-users] Access point scale
  - From: Beyer, Christoph

Mailing List Archives