Hi Stephen,

To answer the question of why the schedd cannot connect to a target daemon that is no longer registered with CCB, it may help to look in the target daemon's log file, if you can locate it. If the daemon is still running at the time when it is not registered with CCB, you should see a log message that says it became disconnected from CCB and you should also see periodic attempts to reconnect to CCB. The log message showing the disconnect from CCB may help understand why this is happening. If, on the other hand, the daemon is not alive, then we need to understand why. The log file may help with that too.

Regarding the exhaustion of file descriptors: if condor is started as root (the default for an rpm installation), the best way to configure the maximum number of file descriptors available to the collector is to use something like the following configuration setting in the htcondor config file:


When the collector starts up, you will see a line in the log file that looks like this:

"Setting maximum file descriptors to 10000."

If condor is started as root, it can set its limit higher than the default hard limit. If it is not started as root, then it can only decrease the limit. I recommend using this configuration setting, rather than trying to set the per-process default, because some mechanisms for setting the per-process default (e.g. PAM settings) are not necessarily applied to condor processes, and, anyway, the consequences of having a huge file descriptor limit for all processes may not be good. For example, many processes use more memory when the file descriptor limit is high. For a process such as the condor_shadow, this may add up to a lot of memory, since there may be many instances of the shadow process.


On 2/8/13 1:46 PM, Stephen Pietrowicz wrote:

I'm seeing the following message a significant number of times in some of the larger runs we've started to do:

02/08/13 12:22:26 CCB: rejecting request from SCHEDD <www.xxx.yyy.zzz:50190> on <www.xxx.yyy.zzz:40460> for ccbid 6987 because no daemon is currently registered with that id (perhaps it recently disconnected).

Eventually, we get:

**** PANIC -- OUT OF FILE DESCRIPTORS at line 175 in /slots/01/dir_65060/userdir/src/condor_io/reli_sock.cpp

And in /var/log/messages, I'm seeing:

Feb  8 10:59:59 lsst-launch kernel: possible SYN flooding on port 9618. Sending cookies.

We had been running jobs of about 500 slots or so, and have started to try and run at 1000+ slots simultaneously.  The Collector machine and the submit machine both have up-ed the number of file descriptors to over 400,000 per process.

Any ideas?

