[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached



Hi again,

Brian and I looked at this off-list and so far, it seems we hit a TCP
limit on the host, though it is not yet clear to me, what the problem is.

Most telling are lines like

05/21/20 19:19:42 (2076445.0) (2492275): attempt to connect to
<10.20.30.17:14867> failed: Connection timed out (connect errno = 110).
 Will keep trying for 300 total seconds (269 to go).

from the ShadowLog where the shadows cannot connect to the schedd on the
very same host anymore...

The current work-around for us is to use

MAX_JOBS_RUNNING        = 30000

as we saw this happening when we had about 40k shadow processes on the host.

We have tried playing with the usual sysctl suspects, e.g.

net.core.somaxconn
net.core.netdev_max_backlog
net.ipv4.ip_local_port_range
net.ipv4.tcp_fin_timeout
net.ipv4.tcp_max_syn_backlog

but to no avail :(

If anyone has submit hosts on Linux beyond 40k active shadow, please let
us know what we are missing ;-)

Cheers

Carsten

-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature