Hi again,

Brian and I looked at this off-list and so far, it seems we hit a TCP
limit on the host, though it is not yet clear to me, what the problem is.

Most telling are lines like

05/21/20 19:19:42 (2076445.0) (2492275): attempt to connect to
<> failed: Connection timed out (connect errno = 110).
 Will keep trying for 300 total seconds (269 to go).

from the ShadowLog where the shadows cannot connect to the schedd on the
very same host anymore...

The current work-around for us is to use

MAX_JOBS_RUNNING        = 30000

as we saw this happening when we had about 40k shadow processes on the host.

We have tried playing with the usual sysctl suspects, e.g.


but to no avail :(

If anyone has submit hosts on Linux beyond 40k active shadow, please let
us know what we are missing ;-)



