[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Increasing max open files for condor_schedd



How do I increase the maximum number of open file descriptors for condor_schedd version 9.0.15 running on Scientific Linux 7.9?

I am able to increase the value for condor_master with,

[root@ldas-pcdev10 ~]# cat /etc/systemd/system/condor.service.d/filelimit.conf 
[Service]
LimitNOFILE=65536

[root@ldas-pcdev10 ~]# systemctl daemon-reload
[root@ldas-pcdev10 ~]# systemctl restart condor
[root@ldas-pcdev10 ~]# cat /proc/`pgrep condor_master`/limits | egrep -e "Limit|files"
Limit                     Soft Limit           Hard Limit           Units     
Max open files            65536                65536                files     

However, only the hard limit propagates to a new condor_schedd process started by condor_master, but not the soft limit:

[root@ldas-pcdev10 ~]# cat /proc/`pgrep condor_schedd`/limits | egrep -e "Limit|files"
Limit                     Soft Limit           Hard Limit           Units     
Max open files            4096                 65536                files  

Note, manually changing the value of the running condor_schedd process with prlimit seems to be too late to avoid the problem I am trying to solve:

> 08/03/22 11:42:43 (59432875.0) (108176): ERROR "Error from slot1@glidein_1823669_374176660@node505.cluster.ldas.cit: Failed to transfer files" at line 583 in file /var/lib/condor/execute/slot1/dir_118052/userdir/.tmpSE9flx/BUILD/condor-9.0.13/src/condor_shadow.V6.1/pseudo_ops.cpp
> 08/03/22 11:42:43 (59445968.2) (108181): Request to transfer files for 59445968.2 (/tmp/x509up_u40348) was rejected by schedd at <131.215.113.204:10633>: file descriptor safety level exceeded:  limit 3277,  registered socket count 3278,  fd 1795
> 08/03/22 11:42:43 (59445968.2) (108181): Sending NO GoAhead for 10.9.3.46 to receive /tmp/x509up_u40348.
> 08/03/22 11:42:43 (59445968.2) (108181): Request to transfer files for 59445968.2 (/tmp/x509up_u40348) was rejected by schedd at <131.215.113.204:10633>: file descriptor safety level exceeded:  limit 3277,  registered socket count 3278,  fd 1795
> 08/03/22 11:42:43 (59445968.2) (108181): File transfer failed (status=0).

Note, as pointed out by Todd Miller, 3277/4096 is 80%, which is presumably an internal Condor limit set early at process start time.

P.S. It might be worth documenting the answer on the following wiki page that already talks about Condor file descriptor limits, but doesn't cover this use case,
https://htcondor.org/wiki-archive/pages/LinuxTuning/

Thanks.

--
Stuart Anderson
sba@xxxxxxxxxxx