[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Panic: "out of file descriptors"



On Thu March 30 2006 9:25 am, Bob Krzaczek wrote:
> Hello,
Hello,

> I asked about this back in early February, and though we've been
> trying a number of different tactics to solve the problem, so far
> it stays with us.  The Condor schedd panics on many of our systems,
> claiming it is out of file descriptors.  These systems are all
> running Condor 6.7.17 and 6.7.14.

Solaris's libc has a "feature" can cause Condor grief.  Under Solaris, 'FILE 
*' file handles can only operate on file descriptors less than 256.

In Condor, we have some special code which uses a magic fcntl() call to "move" 
the socket FDs to be above 256 to work around this problem, and it certainly 
does help.  However, I'm speculating that we're not doing the same thing for 
pipes.  Since the schedd creates pipes for talking to it's shadows, I'm 
guessing that this is what's causing the stdio calls to run out of low 
descriptors.

>From your description, it appears that Sun hasn't fixed this behavior in 
Solaris 10.  :-(

> Can anyone give me some insight into why this is happening to the
> Condor schedd?  At the least, what leads us into this state? (heck,
> I'd be happy to just read source code at this point!)  Here's a
> snip from the log.  If running with other debug flags would help,
> just let me know.  I'm also quite willing to try other things, or
> share other information.  Our users regularly queue hundreds to
> thousands of serial jobs that run here and flock to another pool;
> getting this cleaned up would be wonderful.

Does the above help you to understand what's going wrong?

> 3/30 09:56:03 (pid:2679)  --- End of Daemon object info ---
> 3/30 09:56:03 (pid:2679) Sock::bind - _state is not correct
> 3/30 09:56:03 (pid:2679) SafeSock::connect bind() failed: _state = 0
> 3/30 09:56:03 (pid:2679) Can't connect to startd at <129.21.37.234:56067>
> 3/30 09:56:03 (pid:2679) Match record (<129.21.37.234:56067>, 44651, 0)
> deleted 3/30 09:56:03 (pid:2679) New Daemon obj (startd) name: "NULL",
> pool: "NULL", addr: "<129.21.37.138:32780>" **** PANIC -- OUT OF FILE
> DESCRIPTORS at line 726 in dprintf.c

I don't understand why the bind() here fails, however.

We need to understand what's going on here.

> Any assistance or ideas would be *so* appreciated!

Well, I can tell you that Linux's stdio is much more sane in this respect.  I 
can also tell you that at least the panic problems are most likely not caused 
by a general "ran out of file descriptors", but, as I said above, instead by 
"ran out of low file descriptors".

I'm going to create a problem ticket about this; an informal group of us here 
have talked about several possible solutions, but we're not quite sure what 
course to follow yet.  Sigh.

> Cheers,
> Bob

-Nick

-- 
           <<< The answer is out there, Neo. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences