[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Strange schedd crash (exit status 44)



Answered in-line below.

Thanks.

Ian

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of matthew hope
> Sent: November 25, 2004 4:58 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Strange schedd crash (exit status 44)
> 
> Note that rolling back the submitter machine to 6.6.x version 
> appears to cure the problem.
> 
> So if you can handle loosing the new features for a while for 
> stability you might want to do that.

This is a 6.7.2 installation and unfortunatly we need that
MaxJobRetirementTime feature for condor to exist in our environment.
Stuck between a rock and hard place we may be here.
 
> The startd/master/negotiator/collector appear to have no 
> serious issues in 6.7.2

I have to strongly disagree here. We've continually grappled with schedd
and startd crashes in our 6.7.2 installation and we still have issues
with preening killing off all the processes on our windows machines. And
there's a static memory leak in the schedd that causes it's memory usage
to slowly climb even if it's not currently involved in scheduling. That
being said all of the issues we're experiencing are limited to Windows.
Our handful of Linux machines are fine.

> It definitely appears to be a load issue, ensure your queue 
> doesn't get too big and that you are not submitting from a 
> machine with active startd's (set num_virtual_machines = 0 
> and reconfig to sort this fast)

Okay, what's "too big"? In our current in-house system we keep 1k+ jobs
queued up from any single user at any time. Is that "too big" for
Condor? That's the kind of queueing capabilities our users require.

In first crash case I reported two days ago the user had ~200 jobs
queued up and in the second crash case it was ~15 jobs. There are only
20 startds in our system now so there were no more than 20 shadows on
either of these machines at any given time. Both machines have startd's
but the policy is configured to only run jobs after work hours and these
crashes happen both after and during work hours.

Does this seem like an unreasonable queue size to you? I expect users
would approach the 1000+ queued jobs from any windows machine in our
system

> Matt
> 
> On Wed, 24 Nov 2004 17:33:35 -0500, Ian Chesal 
> <ichesal@xxxxxxxxxx> wrote:
> > I checked both submission machines that have experienced this crash 
> > and neither of them had core files present. And nothing 
> that ended in 
> > *.dprintf could be located.
> > 
> > Ian
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: condor-users-bounces@xxxxxxxxxxx 
> > > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of 
> Nick Partner
> > > Sent: November 24, 2004 5:23 PM
> > > To: 'Condor-Users Mail List'
> > > Subject: RE: [Condor-users] Strange schedd crash (exit status 44)
> > >
> > > Hi,
> > >
> > > When the schedd crashed was there any core files in the log 
> > > directory.  In particular was there schedd.dprintf core 
> file there?
> > >
> > > Nick
> > >
> > > -----Original Message-----
> > > From: condor-users-bounces@xxxxxxxxxxx 
> > > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
> > > Sent: 24 November 2004 20:54
> > > To: Condor-Users Mail List
> > > Subject: RE: [Condor-users] Strange schedd crash (exit status 44)
> > >
> > >
> > > Hmm. Well, we're running on windows. The driving script is a perl 
> > > script wrapped in as a bat file. It's not that the jobs 
> are dying. 
> > > That doesn't bother me. That's our problem. It's that the shadow 
> > > dies and then takes down the schedd process with it. That 
> shouldn't 
> > > happen.
> > >
> > > Ian
> > >
> > >
> > > _______________________________________________
> > > Condor-users mailing list
> > > Condor-users@xxxxxxxxxxx
> > > http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > >
> > 
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>