[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange schedd crash (exit status 44)

On Sun, 28 Nov 2004 23:55:21 -0600, Derek Wright <wright@xxxxxxxxxxx> wrote:
> 1 general comment:
> whenever a condor daemon exits with status 44, it means it failed to
> write to its log file.

Right ho -a useful piece of info thanks,

> so, i'm guessing the disk is filling up on your submit machine.  at
> least, the partition that the SPOOL + LOG directories are on. 

12 GB free :) though there may be some other nasty aspect of windows causing it.

Assuming windows daemons, a job log, input, output and error log the
total open files would be?

master = 1 (master log)
schedd = 3 (scheddlog, history, job_queue)

per job
shadow = 
in, out, error ?  3 (guessing not unless you are streaming)
ShadowLog ? one file but many writers, does the schedd handle this for
them or do they gain a lock each time?
job log 1 (or is it only as required?)

Well under any windows limits I should think, will have a think about
anything else...
> > > <runs condor_q repeatedly>
> yes, that's evil and wrong.  we're sorry.  the schedd should be
> multi-threaded in some way.  

What dev releases are for :¬)
> to some extent, that's what the "MPI" universe already does (and it'll
> get much better in the near future with a generic, more usable
> "parallel" universe).  but, point well taken.  it's something we've
> been arguing about for years. ;) in theory, there's already pluggable
> negotiation, in that each schedd does it's own decentralized
> scheduling.  you're more than welcome to write your own schedd and
> have it talk to an existing condor pool. ;) (yeah, right).

Pop the source and over the wire specs out and I'd consider it.
Really. I'm sure I could convince my bosses to allow folding back into
the main branch (if you could handle my code :¬)

> > > that said I'll go for stability over features every time at
> > > the moment!
> right.  then use the stable release. ;)

Fair point, have rolled back the submitters machines already