[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Out of file descriptors problem

On Thu, 2006-04-20 at 14:21 -0500, Todd Tannenbaum wrote:
> As for 16k descriptors, with over 1000 jobs simultaneously running 
> (and thus over 1000 shadow processes), that only leaves 16 
> descriptors per shadow
No, I don't think that's the problem. The problem is that during the
evening Condor is happy to use most of the 1,300 nodes with no problems.
At midnight all (I think) of the Windows nodes are rebooted (don't ask;
it's to do with patching Windows). But what we have been finding for the
past few days is that in the morning Condor is running (on the server
and nodes) but that only about 20 or so nodes are being used, whereas
there are over a thousand available. Previously (nearly) all the nodes
would have been used as expected.

The only indication of a problem we can find is the odd file dumped in
the log directory stating that it is out of file descriptors. However,
the date/time of the file is midday, not any time near midnight. So I'm
not even convinced that that is anything to do with it.

The same has happened today - we currently have just 16 nodes being used
out of just under 1,000. Nothing has been logged about file descriptors
though, so as I suspect that may be a red-herring.


John Horne, University of Plymouth, UK  Tel: +44 (0)1752 233914
E-mail: John.Horne@xxxxxxxxxxxxxx       Fax: +44 (0)1752 233839