[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached

Hi again,

Brian and I looked at this off-list and so far, it seems we hit a TCP
limit on the host, though it is not yet clear to me, what the problem is.

Most telling are lines like

05/21/20 19:19:42 (2076445.0) (2492275): attempt to connect to
<> failed: Connection timed out (connect errno = 110).
 Will keep trying for 300 total seconds (269 to go).

from the ShadowLog where the shadows cannot connect to the schedd on the
very same host anymore...

The current work-around for us is to use

MAX_JOBS_RUNNING        = 30000

as we saw this happening when we had about 40k shadow processes on the host.

We have tried playing with the usual sysctl suspects, e.g.


but to no avail :(

If anyone has submit hosts on Linux beyond 40k active shadow, please let
us know what we are missing ;-)



Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature