[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] "file descriptors" problem again



Thanks for this. Will report back when I have a
chance to try it.

-ian.

--On 04 September 2006 15:48 -0500 Steven Timm <timm@xxxxxxxx> wrote:

There is a known bug in condor 6.7.20 and 6.8.0 with non-blocking output.
There is a settings
NONBLOCKING_COLLECTOR_UPDATE
NEGOTIATOR_USE_NONBLOCKING_STARTD_CONTACT
which if set to FALSE will cause the negotiator to crash periodically
for being out of file descriptors but if left at their default TRUE
will cause the schedd and collector to crash.

Bug is supposed to be fixed in 6.8.1.  Condor team gave me a pre-release
of schedd, collector, negotiator, which fixed the problem at our site.
If you have to pick your poison, leave them set false as the negotiator
crash condor can easily recover from.

Steve


------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx
http://home.fnal.gov/~timm/ Fermilab Computing Div/Core Support Services
Dept./Scientific Computing Section Assistant Group Leader, Farms and
Clustered Systems Group
Lead of Computing Farms Team

On Mon, 4 Sep 2006, Dr Ian C. Smith wrote:

Hi,

I recently upgraded to Condor 6.8.0 on our central manager in order to
fix a problem with Condor. See:

https://lists.cs.wisc.edu/archive/condor-users/2006-August/msg00039.shtml

This solved the problem but instead I started to see exactly the
same "out of file descriptors" messages errors as reported
in

https://lists.cs.wisc.edu/archive/condor-users/2006-April/msg00191.shtml

The symptoms are the same - after the daily reboot of the windows
execution hosts a large number sit idle even though there is a big
(20,000) queue of jobs waiting to run. When I went back to 6.6.9 the
problem disappeared.

I'm wondering if, as has been suggested, that the "out of file
descriptors" is a red herring - the OS is the same (solaris 8) and none
of the limits have been changed. At most there are around 100 jobs
running concurrently with vanilla universe. The default limit (ulimit
-n) is 256 (although I understand that this is per process).

Any ideas about this ? Would a diff(1) of the two codes show up anything.
I could move the Condor-G to another hosts to get around the first
problem but I'm more concerned that the Windows central manager is going
to get stuck with an out of date version of condor.

cheers,

-ian.

-----------------------------------
Dr Ian C. Smith,
e-Science team,
University of Liverpool
Computing Services Department









_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR