[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] no ccbid currently registered.



In addition to the always helpful advice from Dan below, besides the per-process file descriptor limits, consider the limit max number of file descriptors system-wide (aka the limit across all processes). Some distros set this somewhat low, esp if your central manager is also serving as a submit machine for running thousands of jobs.

For the system-wide limit, I'd suggest increasing it to 1,000,000. Our submit machines at UW-Madison are configured to use 6,400,000 (they are quite busy). You can change the limit on a temporary basis by running this command:

echo 1000000 >/proc/sys/fs/file-max

You can make the change permanent by setting this in /etc/sysctl.conf (this may vary depending on the Linux distro you are using):

fs.file-max = 1000000

You need about 55 kernel file descriptors per running job on a submit machine, largely due to the number of shared libraries pulled into the condor_shadow process. 1,000,000 should give you plenty of headroom to run 5000 jobs.

regards
Todd

On 2/8/2013 2:14 PM, Dan Bradley wrote:
Hi Stephen,

To answer the question of why the schedd cannot connect to a target
daemon that is no longer registered with CCB, it may help to look in the
target daemon's log file, if you can locate it.  If the daemon is still
running at the time when it is not registered with CCB, you should see a
log message that says it became disconnected from CCB and you should
also see periodic attempts to reconnect to CCB.  The log message showing
the disconnect from CCB may help understand why this is happening.  If,
on the other hand, the daemon is not alive, then we need to understand
why.  The log file may help with that too.

Regarding the exhaustion of file descriptors: if condor is started as
root (the default for an rpm installation), the best way to configure
the maximum number of file descriptors available to the collector is to
use something like the following configuration setting in the htcondor
config file:

COLLECTOR_MAX_FILE_DESCRIPTORS = 10000

When the collector starts up, you will see a line in the log file that
looks like this:

"Setting maximum file descriptors to 10000."

If condor is started as root, it can set its limit higher than the
default hard limit.  If it is not started as root, then it can only
decrease the limit.  I recommend using this configuration setting,
rather than trying to set the per-process default, because some
mechanisms for setting the per-process default (e.g. PAM settings) are
not necessarily applied to condor processes, and, anyway, the
consequences of having a huge file descriptor limit for all processes
may not be good.  For example, many processes use more memory when the
file descriptor limit is high.  For a process such as the condor_shadow,
this may add up to a lot of memory, since there may be many instances of
the shadow process.

--Dan

On 2/8/13 1:46 PM, Stephen Pietrowicz wrote:
Hi,

I'm seeing the following message a significant number of times in some
of the larger runs we've started to do:

02/08/13 12:22:26 CCB: rejecting request from SCHEDD
<www.xxx.yyy.zzz:50190> on <www.xxx.yyy.zzz:40460> for ccbid 6987
because no daemon is currently registered with that id (perhaps it
recently disconnected).

Eventually, we get:

**** PANIC -- OUT OF FILE DESCRIPTORS at line 175 in
/slots/01/dir_65060/userdir/src/condor_io/reli_sock.cpp

And in /var/log/messages, I'm seeing:

Feb  8 10:59:59 lsst-launch kernel: possible SYN flooding on port
9618. Sending cookies.

We had been running jobs of about 500 slots or so, and have started to
try and run at 1000+ slots simultaneously.  The Collector machine and
the submit machine both have up-ed the number of file descriptors to
over 400,000 per process.

Any ideas?

Steve
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685