We recently discovered that people run into issues with more than 32k running jobs on a submit host. It turns out that systemd is enforcing a limit of 32k on tasks and files for HTCondor. We made a change to the condor.service unit file to set 'TasksMax' and 'LimitNOFILE' to infinity. That change will appear in the next release.
Perhaps, that could be the source of your problem?
Hi again, Brian and I looked at this off-list and so far, it seems we hit a TCP limit on the host, though it is not yet clear to me, what the problem is. Most telling are lines like 05/21/20 19:19:42 (2076445.0) (2492275): attempt to connect to <10.20.30.17:14867> failed: Connection timed out (connect errno = 110). Will keep trying for 300 total seconds (269 to go). from the ShadowLog where the shadows cannot connect to the schedd on the very same host anymore... The current work-around for us is to use MAX_JOBS_RUNNING = 30000 as we saw this happening when we had about 40k shadow processes on the host. We have tried playing with the usual sysctl suspects, e.g. net.core.somaxconn net.core.netdev_max_backlog net.ipv4.ip_local_port_range net.ipv4.tcp_fin_timeout net.ipv4.tcp_max_syn_backlog but to no avail :( If anyone has submit hosts on Linux beyond 40k active shadow, please let us know what we are missing ;-) Cheers Carsten
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
-- Tim Theisen Release Manager HTCondor & Open Science Grid Center for High Throughput Computing Department of Computer Sciences University of Wisconsin - Madison 4261 Computer Sciences and Statistics 1210 W Dayton St Madison, WI 53706-1685 +1 608 265 5736