Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached

Date: Mon, 25 May 2020 07:19:16 +0200
From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
Subject: Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached

Hi again,

Brian and I looked at this off-list and so far, it seems we hit a TCP
limit on the host, though it is not yet clear to me, what the problem is.

Most telling are lines like

05/21/20 19:19:42 (2076445.0) (2492275): attempt to connect to
<10.20.30.17:14867> failed: Connection timed out (connect errno = 110).
 Will keep trying for 300 total seconds (269 to go).

from the ShadowLog where the shadows cannot connect to the schedd on the
very same host anymore...

The current work-around for us is to use

MAX_JOBS_RUNNING        = 30000

as we saw this happening when we had about 40k shadow processes on the host.

We have tried playing with the usual sysctl suspects, e.g.

net.core.somaxconn
net.core.netdev_max_backlog
net.ipv4.ip_local_port_range
net.ipv4.tcp_fin_timeout
net.ipv4.tcp_max_syn_backlog

but to no avail :(

If anyone has submit hosts on Linux beyond 40k active shadow, please let
us know what we are missing ;-)

Cheers

Carsten

-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
  - From: Tim Theisen

References:
- [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
  - From: Carsten Aulbert
- Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
  - From: Bockelman, Brian

Prev by Date: [HTCondor-users] restricting admin commands
Next by Date: Re: [HTCondor-users] execute hosts advertise loopback address
Previous by thread: Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
Next by thread: Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached