[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Schedd out of file descriptors



We're running 6.7.14 on mostly SPARC Solaris 9 systems (with a few
Solaris 10 systems running the 2.9 executables), and I've gotten
bursts of these recently after a large job is run by some of our
users.

I'll start to hunt down just what jobs do and do not cause this,
as well as other relevant information that might be needed (we have
some smaller machines that limit how many processes they can
successfully maintain), but in the meantime I thought I'd ask: is
this file descriptor leak a known problem, and was it fixed in
6.7.16?  (I didn't see mention of it in the release notes).  Or
does the message in the log file maybe reflect something other than
what it says?

I've left some debugs enabled from before, when I was debugging
running Condor on systems using ipfilter to provide local firewalling.
I see a lot of "nulls" below, I'm wondering if there's a configuration
issue haunting me as well?  (Which would be odd to learn, because
a lot of our users are having success with this new Condor flock).
But, obviously, it's the last line below that's the most troubling,
and the reason I'm writing.

*** Last 20 line(s) of file SchedLog:
2/11 03:51:51 (pid:24017) Sock::bind - _state is not correct
2/11 03:51:51 (pid:24017) Couldn't initiate connection to <129.21.37.110:32780>
2/11 03:51:51 (pid:24017) Destroying Daemon object:
2/11 03:51:51 (pid:24017) Type: 4 (startd), Name: (null), Addr: <129.21.37.110:32780>
2/11 03:51:51 (pid:24017) FullHost: (null), Host: (null), Pool: (null), Port: -1
2/11 03:51:51 (pid:24017) IsLocal: N, IdStr: (null), Error: (null)
2/11 03:51:51 (pid:24017)  --- End of Daemon object info ---
2/11 03:51:51 (pid:24017) Sock::bind - _state is not correct
2/11 03:51:51 (pid:24017) SafeSock::connect bind() failed: _state = 0
2/11 03:51:51 (pid:24017) Can't connect to startd at <129.21.37.110:32780>
2/11 03:51:51 (pid:24017) Match record (<129.21.37.110:32780>, 10256, 0) deleted
2/11 03:51:51 (pid:24017) New Daemon obj (startd) name: "NULL", pool: "NULL", addr: "<129.21.37.111:32780>"
2/11 03:51:51 (pid:24017) Sock::bind - _state is not correct
2/11 03:51:51 (pid:24017) Couldn't initiate connection to <129.21.37.111:32780>
2/11 03:51:51 (pid:24017) Destroying Daemon object:
2/11 03:51:51 (pid:24017) Type: 4 (startd), Name: (null), Addr: <129.21.37.111:32780>
2/11 03:51:51 (pid:24017) FullHost: (null), Host: (null), Pool: (null), Port: -1
2/11 03:51:51 (pid:24017) IsLocal: N, IdStr: (null), Error: (null)
2/11 03:51:51 (pid:24017)  --- End of Daemon object info ---
**** PANIC -- OUT OF FILE DESCRIPTORS at line 334 in sock.C
*** End of file SchedLog

Thanks for any insights,
Bob

-- 
Bob Krzaczek, Chester F. Carlson Center for Imaging Science, RIT
phone +1-585-4757196, email krz@xxxxxxxxxxx, icbm N43.0859 W77.6776