[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Panic: "out of file descriptors"


I asked about this back in early February, and though we've been
trying a number of different tactics to solve the problem, so far
it stays with us.  The Condor schedd panics on many of our systems,
claiming it is out of file descriptors.  These systems are all
running Condor 6.7.17 and 6.7.14.

At first, we suspected it was a Solaris 10 issue.  We've integrated
Condor into our Jumpstart environment, and written Solaris 10 SMF
service descriptions for it.  Nothing seemed amiss, but we also ran
into some MPI universe issues, so we downgraded our cluster to
Solaris 9 (it's trivial to flip our systems back and forth for
testing).  That didn't help.

We thought that maybe there was a part of Condor that wasn't
respecting the [LOWPORT,HIGHPORT] range, so we killed the local
firewall on these systems.  That didn't make a difference.

Then we thought that maybe "out of file descriptors" might reflect
being unable to obtain a socket, so we widened the port range between
LOWPORT and HIGHPORT.  That didn't help.  Eventually, we undefined
LOWPORT and HIGHPORT, and Condor can now grab any port number it
wants.  That still didn't help.

As a shot in the dark (perhaps there are file descriptors maintained
between parent and child processes), we've even changed things like
the max number of processes per user (now at 1024), as well as
limited the maximum number of Condor shadow processes (now down to
500 on a system with 2GB of RAM).  None of these changes seem to
have helped.

Can anyone give me some insight into why this is happening to the
Condor schedd?  At the least, what leads us into this state? (heck,
I'd be happy to just read source code at this point!)  Here's a
snip from the log.  If running with other debug flags would help,
just let me know.  I'm also quite willing to try other things, or
share other information.  Our users regularly queue hundreds to
thousands of serial jobs that run here and flock to another pool;
getting this cleaned up would be wonderful.

3/30 09:56:03 (pid:2679)  --- End of Daemon object info ---
3/30 09:56:03 (pid:2679) Sock::bind - _state is not correct
3/30 09:56:03 (pid:2679) SafeSock::connect bind() failed: _state = 0
3/30 09:56:03 (pid:2679) Can't connect to startd at <>
3/30 09:56:03 (pid:2679) Match record (<>, 44651, 0) deleted
3/30 09:56:03 (pid:2679) New Daemon obj (startd) name: "NULL", pool: "NULL", addr: "<>"
**** PANIC -- OUT OF FILE DESCRIPTORS at line 726 in dprintf.c

We're also seeing some oddities with MPI universe jobs (I realize
this is going away in favor of the new Parallel universe).  So far,
the behavior changes depending on whether the dedicated scheduler
is a Solaris 10 machine or a Solaris 9 machine; that's surprising
and I suspect our configuration at the moment.  Frankly, I've
backburnered that problem until I get this schedd issue under
control, and I'm only mentioning it now in case there's an immediate
recognition of Solaris 10 issues under Condor.

Any assistance or ideas would be *so* appreciated!


Bob Krzaczek, Chester F. Carlson Center for Imaging Science, RIT
phone +1-585-4757196, email krz@xxxxxxxxxxx, icbm N43.0859 W77.6776