[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Strange schedd crash (exit status 44)



> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of matthew hope
> Sent: November 25, 2004 11:56 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Strange schedd crash (exit status 44)
> 
> On Thu, 25 Nov 2004 11:41:22 -0500, Ian Chesal 
> <ichesal@xxxxxxxxxx> wrote:
> > > From: On Behalf Of matthew hope
> > >
> > > If you keep the submitters at 6.6 and the startd's (and possibly
> > > negotiator?) at 6.7.2 then the retirement should work (modulo not 
> > > being able to choose to use less)
> > >
> > > caveat: this is working from the docs not from having tried it
> > 
> > Okay. I'll have to try this out. It seems like a 
> complicated surgery 
> > to perform though: more than just schedd would have to be 
> replaced, no?
> > Wouldn't condor_q and condor_submit also need to be revert to the 
> > 6.6.x binaries?
> 
> I meant keep the entire submit machine at 6.6, i.e. uninstall 
> / reinstall
>  
> > > shadows and schedd's both have serious stability issues
> > 
> > Agreed. All of our troubles have been with schedd and 
> shadows. To the 
> > condor team then: when can we expect these to stabilize?
> 
> I understand that the team are looking into it but that the 
> Thanksgiving holiday in the US is inevitably going to add some delay.
> In fairness it is a dev release, I am thinking I was too 
> hasty in assuming it would scale to high loads (but how can I 
> test prod load without using the prod system :).

Ahh. Forgot about those darn US holidays.

> Has anyone else got a high load (>100 execute nodes) with a 
> small number of submit points keeping the whole thing highly loaded?
> 
> If they have on *nix land but not windows this would indicate 
> strongly that it was a windows port issue.

I have sucessfully submitted 200+ jobs from my Linux machine that were
targetting our Windows startd machines and the schedd and shadows ran
without any problems. But there was limited concurrency because I too am
not willing to roll this out to our production machines to test a
production load.

I think we need to hear from the Condor team here: what's up with
Windows? Are you guys aware of these issues?

> > > not so hot - is your submit machine under heavy load as well?
> > 
> > Talking with both the engineers I would say, yes. The machines were 
> > probably running processes unrelated to Condor for the 
> users and it is 
> > likely the load was high. That being said, these are all dual 
> > processor
> > 1 GHz PIII machines with 2GB or more of RAM. Not state of 
> the art but 
> > certainly not under powered. The jobs are vanilla jobs that 
> transfer 
> > one small 1k file to the client when they start and then 
> transfer back 
> > ~30k worth of captured STDOUT when the jobs complete.
> 
> If the user constantly runs condor_q (or someone else runs condor_q
> -global) they can seriously affect the schedd.
> 
> It is a vicious circle where the user goes "Why is it so 
> slow? Whats going on?" 
> <runs condor_q>
> "That's bad! I will watch this kettle till it boils"
> <runs condor_q repeatedly> 

Agreed, but this is not likely the cause in our system. In the case of
the first crash it was well before anyone was in the office.

> > I'm with you here: our main batch system is a very simple, in-house 
> > system that can handle queues loads in the tens of 
> thousands of jobs 
> > running on a single CPU 866 MHz PIII with 256 MB of RAM. Granted it 
> > hasn't the complexity of Condor but we're not exploiting all the 
> > interactive capabilities or file transfer capabilities of 
> Condor. The 
> > bulk of our job file transfer is handled by the job scripts we run, 
> > not by Condor, and not to the machine that submitted the job.
> 
> The issue is that the batch system does not need to talk to a 
> central machine to be told to ten talk to the execute 
> machine, nor bother to repeatedly stroke the executor to keep 
> it happy.
> 
> I think that sufficient people run in a tightly coupled and 
> dedicated environment to make it worthwhile making the 
> negotiation process more pluggable to allow us to exploit 
> this (making the negotiator more intelligent but the startd 
> more stupid or vice versa)...
> 
> that said I'll go for stability over features every time at 
> the moment!

Here here! Hopefully there'll be an early Christmas present from the
Condor team in the form of a 6.8.x stable branch. Fingers crossed...

Ian