[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Strange schedd crash (exit status 44)

More below.


> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of matthew hope
> Sent: November 25, 2004 11:26 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Strange schedd crash (exit status 44)
> On Thu, 25 Nov 2004 11:13:01 -0500, Ian Chesal 
> <ichesal@xxxxxxxxxx> wrote:
> > > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of 
> matthew hope
> > 
> > This is a 6.7.2 installation and unfortunatly we need that 
> > MaxJobRetirementTime feature for condor to exist in our environment.
> > Stuck between a rock and hard place we may be here.
> If you keep the submitters at 6.6 and the startd's (and possibly
> negotiator?) at 6.7.2 then the retirement should work (modulo 
> not being able to choose to use less)
> caveat: this is working from the docs not from having tried it

Okay. I'll have to try this out. It seems like a complicated surgery to
perform though: more than just schedd would have to be replaced, no?
Wouldn't condor_q and condor_submit also need to be revert to the 6.6.x
> > > The startd/master/negotiator/collector appear to have no serious 
> > > issues in 6.7.2
> > 
> > I have to strongly disagree here. We've continually grappled with 
> > schedd and startd crashes in our 6.7.2 installation and we 
> still have 
> > issues
> I didn't say the schedd was stable :)
> I've not seen not a single issue with the startd's apart from 
> ignoring claim timeout (though this seems to happen when the 
> schedd's die
> horrifically)
> when coupled with 6.6 schedd's they behave fine, YMMV though...
> shadows and schedd's both have serious stability issues

Agreed. All of our troubles have been with schedd and shadows. To the
condor team then: when can we expect these to stabilize?

> > That
> > being said all of the issues we're experiencing are limited 
> to Windows.
> > Our handful of Linux machines are fine.
> Windows only so couldn't say if it was a windows only issue...
> > > It definitely appears to be a load issue, ensure your 
> queue doesn't 
> > > get too big
> > Okay, what's "too big"? In our current in-house system we keep 1k+ 
> > jobs queued up from any single user at any time. Is that 
> "too big" for 
> > Condor? That's the kind of queueing capabilities our users require.
> my finger in the air from looking at the times it happens is 
> around 100, whether that couts as 100 jobs or  100 clusters 
> is tricky to gauge since at the time they were no more than 2 
> jobs in each cluster
> > In first crash case I reported two days ago the user had ~200 jobs 
> > queued up and in the second crash case it was ~15 jobs.
> not so hot - is your submit machine under heavy load as well?

Talking with both the engineers I would say, yes. The machines were
probably running processes unrelated to Condor for the users and it is
likely the load was high. That being said, these are all dual processor
1 GHz PIII machines with 2GB or more of RAM. Not state of the art but
certainly not under powered. The jobs are vanilla jobs that transfer one
small 1k file to the client when they start and then transfer back ~30k
worth of captured STDOUT when the jobs complete.

> > Does this seem like an unreasonable queue size to you? I 
> expect users 
> > would approach the 1000+ queued jobs from any windows 
> machine in our 
> > system
> As far as I'm concerned the thing should be able to handle 
> queues in the thousands at a minimum (though if they are 
> single job clusters your negotiator will not like it one bit 
> but that's a different issue) for the market it is aimed at...
> just my 2 pence though

I'm with you here: our main batch system is a very simple, in-house
system that can handle queues loads in the tens of thousands of jobs
running on a single CPU 866 MHz PIII with 256 MB of RAM. Granted it
hasn't the complexity of Condor but we're not exploiting all the
interactive capabilities or file transfer capabilities of Condor. The
bulk of our job file transfer is handled by the job scripts we run, not
by Condor, and not to the machine that submitted the job.

> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users