[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] preempting many jobs before maxjobretirementtime is reached

Hi Carsten,

We recently discovered that people run into issues with more than 32k running jobs on a submit host. It turns out that systemd is enforcing a limit of 32k on tasks and files for HTCondor. We made a change to the condor.service unit file to set 'TasksMax' and 'LimitNOFILE' to infinity. That change will appear in the next release.

Perhaps, that could be the source of your problem?


On 5/25/20 12:19 AM, Carsten Aulbert wrote:
Hi again,

Brian and I looked at this off-list and so far, it seems we hit a TCP
limit on the host, though it is not yet clear to me, what the problem is.

Most telling are lines like

05/21/20 19:19:42 (2076445.0) (2492275): attempt to connect to
<> failed: Connection timed out (connect errno = 110).
 Will keep trying for 300 total seconds (269 to go).

from the ShadowLog where the shadows cannot connect to the schedd on the
very same host anymore...

The current work-around for us is to use

MAX_JOBS_RUNNING        = 30000

as we saw this happening when we had about 40k shadow processes on the host.

We have tried playing with the usual sysctl suspects, e.g.


but to no avail :(

If anyone has submit hosts on Linux beyond 40k active shadow, please let
us know what we are missing ;-)



HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:
Tim Theisen
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736