[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trouble getting file descriptor tunings to apply to condor processes when a machine reboots



So you can hand-check what limits are being set at startup by inspecting:

cat /proc/`/sbin/pidof condor_schedd`/limits

RHEL5 init scripts use "runuser" for daemons like Condor. /etc/pam.d/runuser invokes pam_limits.so to regulate limits. So you probably need to set your per-process limits in /etc/security/limits.conf:

condor soft nofile 65535
condor hard nofile 65535

-- Lans Carstensen

Ian Chesal wrote:

I have a strange problem with an RHEL 5 scheduler box. I've applied the usual file system descriptor tuning settings to this box, making them via /etc/sysctl.conf.

On boot they appear to be applied before system services are started:


May 18 07:51:52 hostname sysctl: net.ipv4.ip_forward = 0
May 18 07:51:52 hostname sysctl: net.ipv4.conf.default.rp_filter = 1
May 18 07:51:52 hostname sysctl: net.ipv4.conf.default.accept_source_route = 0
May 18 07:51:52 hostname sysctl: kernel.sysrq = 0
May 18 07:51:52 hostname sysctl: kernel.core_uses_pid = 1
May 18 07:51:52 hostname sysctl: kernel.pid_max = 4194303
May 18 07:51:52 hostname sysctl: fs.file-max = 262144
May 18 07:51:52 hostname sysctl: net.ipv4.ip_local_port_range = 1024 65535
May 18 07:51:52 hostname network: Setting network parameters: succeeded
May 18 07:51:52 hostname network: Bringing up loopback interface: succeeded
May 18 07:51:57 hostname ifup: Enslaving eth0 to bond0
May 18 07:51:57 hostname ifup: Enslaving eth1 to bond0
May 18 07:51:57 hostname network: Bringing up interface bond0: succeeded
May 18 07:52:17 hostname hpsmhd: smhstart startup succeeded
May 18 07:52:17 hostname condor: Starting up Condor
May 18 07:52:17 hostname rc: Starting condor:  succeeded
May 18 07:52:17 hostname crond: crond startup succeeded


But the scheduler on the box, after boot, will still hit file descriptor limits before it's even close to running as many jobs as it can handle:


5/18 08:12:52 Return from Handler <to startd <10.10.10.242:4208>>
5/18 08:12:52 Starting add_shadow_birthdate(1287113.13)
5/18 08:12:52 Started shadow for job 1287113.13 on "<10.10.10.35:1188>", (shadow pid = 19732)

**** PANIC -- OUT OF FILE DESCRIPTORS at line 781 in dprintf.c


The strange thing is: restarting Condor at this point fixes the problem. The scheduler can grow running jobs well beyond the point where it hit that file descriptor limit the first time. It's as if the file descriptor settings weren't in place when the Condor processes were started up on boot.

Anyone else ever run in to a problem like this before?

Regards,
- Ian

--
Ian Chesal
ichesal@xxxxxxxxxxxxxxxxxx
http://www.cyclecomputing.com/


------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/