[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange schedd crash (exit status 44)

1 general comment:

whenever a condor daemon exits with status 44, it means it failed to
write to its log file.  that's why the log message isn't so
helpful. ;) whenever it does this, the daemon will try to open a file
in your LOG directory, called "dprintf_failure.[DAEMON-NAME]" (for
example, "$(LOG)/dprintf_failure.SCHEDD").  that file will attempt to
include some info about what failed, but if the daemon couldn't write
to the log file, chances are it won't be able to write to the log
directory, either. :(

so, i'm guessing the disk is filling up on your submit machine.  at
least, the partition that the SPOOL + LOG directories are on.  it's
possible that heavy load is making this worse, or there are other
things going on, but that's what status 44 means to me...  however, i
hardly ever touch the windows port, and have never once started up
condor daemons on a windows machine. ;)

now, on to some more specific questions from this thread...

On Thu, 25 Nov 2004 12:02:42 -0500  "Ian Chesal" wrote:

> I think we need to hear from the Condor team here: what's up with
> Windows? Are you guys aware of these issues?

i'm the wrong person to answer this, but i'll make sure our windows
expert sees this thread and replies.

> > If the user constantly runs condor_q (or someone else runs condor_q
> > -global) they can seriously affect the schedd.
> > 
> > It is a vicious circle where the user goes "Why is it so 
> > slow? Whats going on?" 
> > <runs condor_q>
> > "That's bad! I will watch this kettle till it boils"
> > <runs condor_q repeatedly> 

yes, that's evil and wrong.  we're sorry.  the schedd should be
multi-threaded in some way.  it could fork to handle condor_q
requests, for example.  there are a ton of places where it opens a TCP
connection to somewhere (usually a startd) and that connect() can
block until it hits a timeout.  there are other known problems with
schedd scalability, as well.  this is all true of both the *nix and
windows ports.  on the unix side, we're pushing single schedds to
manage ~5000 jobs running simultaneously, and that's with a lot of
fine tuning and tricks that only work in some environments.  we're
planning to address some of these schedd scalability limitations in
the very near term, possibly even having some of them fixed in 6.7.3.

> > The issue is that the batch system does not need to talk to a 
> > central machine to be told to ten talk to the execute 
> > machine, nor bother to repeatedly stroke the executor to keep 
> > it happy.
> >
> > I think that sufficient people run in a tightly coupled and 
> > dedicated environment to make it worthwhile making the 
> > negotiation process more pluggable to allow us to exploit 
> > this (making the negotiator more intelligent but the startd 
> > more stupid or vice versa)...

to some extent, that's what the "MPI" universe already does (and it'll
get much better in the near future with a generic, more usable
"parallel" universe).  but, point well taken.  it's something we've
been arguing about for years. ;) in theory, there's already pluggable
negotiation, in that each schedd does it's own decentralized
scheduling.  you're more than welcome to write your own schedd and
have it talk to an existing condor pool. ;) (yeah, right).  in
practice, no one ever does this (for obvious reasons).  so, it's
something we'll have to continue to deal with.

> > that said I'll go for stability over features every time at 
> > the moment!

right.  then use the stable release. ;)

> Here here! Hopefully there'll be an early Christmas present from the
> Condor team in the form of a 6.8.x stable branch. Fingers crossed...

don't hold your breath.  y'all should be happy if we have 6.8.x out by
condor week in march!  i won't even promise that much. ;) that said,
6.7.3 should be out well before the new year, if not also a 6.7.4 or
more.  and, we'll do what we can to speed up the schedd in the 6.7.x
series (along with a bunch of other cool stuff we're adding...).