[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange schedd crash (exit status 44)



On Thu, 25 Nov 2004 11:41:22 -0500, Ian Chesal <ichesal@xxxxxxxxxx> wrote:
> > From: On Behalf Of matthew hope
> >
> > If you keep the submitters at 6.6 and the startd's (and possibly
> > negotiator?) at 6.7.2 then the retirement should work (modulo
> > not being able to choose to use less)
> >
> > caveat: this is working from the docs not from having tried it
> 
> Okay. I'll have to try this out. It seems like a complicated surgery to
> perform though: more than just schedd would have to be replaced, no?
> Wouldn't condor_q and condor_submit also need to be revert to the 6.6.x
> binaries?

I meant keep the entire submit machine at 6.6, i.e. uninstall / reinstall
 
> > shadows and schedd's both have serious stability issues
> 
> Agreed. All of our troubles have been with schedd and shadows. To the
> condor team then: when can we expect these to stabilize?

I understand that the team are looking into it but that the
Thanksgiving holiday in the US is inevitably going to add some delay.
In fairness it is a dev release, I am thinking I was too hasty in
assuming it would scale to high loads (but how can I test prod load
without using the prod system :).

Has anyone else got a high load (>100 execute nodes) with a small
number of submit points keeping the whole thing highly loaded?

If they have on *nix land but not windows this would indicate strongly
that it was a windows port issue.

> > not so hot - is your submit machine under heavy load as well?
> 
> Talking with both the engineers I would say, yes. The machines were
> probably running processes unrelated to Condor for the users and it is
> likely the load was high. That being said, these are all dual processor
> 1 GHz PIII machines with 2GB or more of RAM. Not state of the art but
> certainly not under powered. The jobs are vanilla jobs that transfer one
> small 1k file to the client when they start and then transfer back ~30k
> worth of captured STDOUT when the jobs complete.

If the user constantly runs condor_q (or someone else runs condor_q
-global) they can seriously affect the schedd.

It is a vicious circle where the user goes 
"Why is it so slow? Whats going on?" 
<runs condor_q> 
"That's bad! I will watch this kettle till it boils"
<runs condor_q repeatedly> 


> I'm with you here: our main batch system is a very simple, in-house
> system that can handle queues loads in the tens of thousands of jobs
> running on a single CPU 866 MHz PIII with 256 MB of RAM. Granted it
> hasn't the complexity of Condor but we're not exploiting all the
> interactive capabilities or file transfer capabilities of Condor. The
> bulk of our job file transfer is handled by the job scripts we run, not
> by Condor, and not to the machine that submitted the job.

The issue is that the batch system does not need to talk to a central
machine to be told to ten talk to the execute machine, nor bother to
repeatedly stroke the executor to keep it happy.

I think that sufficient people run in a tightly coupled and dedicated
environment to make it worthwhile making the negotiation process more
pluggable to allow us to exploit this (making the negotiator more
intelligent but the startd more stupid or vice versa)...

that said I'll go for stability over features every time at the moment!

Matt