Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached

Date: Tue, 26 May 2020 08:33:08 -0500
From: Tim Theisen <tim@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached

Hi Carsten,

We recently discovered that people run into issues with more than 32k running jobs on a submit host. It turns out that systemd is enforcing a limit of 32k on tasks and files for HTCondor. We made a change to the condor.service unit file to set 'TasksMax' and 'LimitNOFILE' to infinity. That change will appear in the next release.

Perhaps, that could be the source of your problem?

..Tim

On 5/25/20 12:19 AM, Carsten Aulbert wrote:

Hi again,

Brian and I looked at this off-list and so far, it seems we hit a TCP
limit on the host, though it is not yet clear to me, what the problem is.

Most telling are lines like

05/21/20 19:19:42 (2076445.0) (2492275): attempt to connect to
<10.20.30.17:14867> failed: Connection timed out (connect errno = 110).
 Will keep trying for 300 total seconds (269 to go).

from the ShadowLog where the shadows cannot connect to the schedd on the
very same host anymore...

The current work-around for us is to use

MAX_JOBS_RUNNING        = 30000

as we saw this happening when we had about 40k shadow processes on the host.

We have tried playing with the usual sysctl suspects, e.g.

net.core.somaxconn
net.core.netdev_max_backlog
net.ipv4.ip_local_port_range
net.ipv4.tcp_fin_timeout
net.ipv4.tcp_max_syn_backlog

but to no avail :(

If anyone has submit hosts on Linux beyond 40k active shadow, please let
us know what we are missing ;-)

Cheers

Carsten

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Tim Theisen
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736

Follow-Ups:
- Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
  - From: Carsten Aulbert

References:
- [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
  - From: Carsten Aulbert
- Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
  - From: Bockelman, Brian
- Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
  - From: Carsten Aulbert

Prev by Date: Re: [HTCondor-users] GPU monitoring vanished in my pool :(
Next by Date: Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
Previous by thread: Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
Next by thread: Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached