[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange schedd crash (exit status 44)



On Thu, 25 Nov 2004 11:13:01 -0500, Ian Chesal <ichesal@xxxxxxxxxx> wrote:
> > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of matthew hope
> 
> This is a 6.7.2 installation and unfortunatly we need that
> MaxJobRetirementTime feature for condor to exist in our environment.
> Stuck between a rock and hard place we may be here.

If you keep the submitters at 6.6 and the startd's (and possibly
negotiator?) at 6.7.2 then the retirement should work (modulo not
being able to choose to use less)

caveat: this is working from the docs not from having tried it
 
> > The startd/master/negotiator/collector appear to have no
> > serious issues in 6.7.2
> 
> I have to strongly disagree here. We've continually grappled with schedd
> and startd crashes in our 6.7.2 installation and we still have issues

I didn't say the schedd was stable :)

I've not seen not a single issue with the startd's apart from ignoring
claim timeout (though this seems to happen when the schedd's die
horrifically)

when coupled with 6.6 schedd's they behave fine, YMMV though...

shadows and schedd's both have serious stability issues

> That
> being said all of the issues we're experiencing are limited to Windows.
> Our handful of Linux machines are fine.

Windows only so couldn't say if it was a windows only issue...

> > It definitely appears to be a load issue, ensure your queue
> > doesn't get too big

> Okay, what's "too big"? In our current in-house system we keep 1k+ jobs
> queued up from any single user at any time. Is that "too big" for
> Condor? That's the kind of queueing capabilities our users require.

my finger in the air from looking at the times it happens is around
100, whether that couts as 100 jobs or  100 clusters is tricky to
gauge since at the time they were no more than 2 jobs in each cluster
 
> In first crash case I reported two days ago the user had ~200 jobs
> queued up and in the second crash case it was ~15 jobs.

not so hot - is your submit machine under heavy load as well?

> Does this seem like an unreasonable queue size to you? I expect users
> would approach the 1000+ queued jobs from any windows machine in our
> system

As far as I'm concerned the thing should be able to handle queues in
the thousands at a minimum (though if they are single job clusters
your negotiator will not like it one bit but that's a different issue)
for the market it is aimed at...

just my 2 pence though