[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Submit host has stopped working on productionsystem



> > I've been trying to make a small config change to
> > our condor submit host and it looks as though it
> > has brought the whole pool down and killed off all
> > of the jobs running on it after it was working fine
> > for months.
> >
> > After the condor_master and condor_schedd
> > were restarted the schedd seems to have gone
> > beserk and is taking > 98 % of the CPU. If I
> > try to run condor_q it just freezes. condor_status
> > is OK.
> 
> It's all working again now (phew !). Looks like
> the submit host was just overloaded - the schedd
> is now taking just about 25 % of the CPU. All of
> the jobs seem to be running again.

I can tell you exactly what happened - I've seen it myself.  It wasn't
your fault, and it's fixed in 6.6.10:

http://www.cs.wisc.edu/condor/manual/v6.6/8_2Stable_Release.html

"Fixed a bug that could cause the file job_queue.log in the Condor SPOOL
directory to grow unnecessarily large, thereby slowing down the startup
and/or shutdown times for the condor_schedd daemon."

Some background...

The schedd keeps persistent storage of job ads in its job_queue.log
file.  It only appends info to this log, so during normal operation it
only grows, never shrinks.  Occasionally, it's supposed to "clean" the
job queue.  That is, it stops all normal processing and goes through the
job_queue.log file tossing out old job ads that have already left the
system. 

Well...this "clean" operation wasn't happening automatically; it was
only occurring on upon reconfig and startup of the schedd.  The result
was that if you had a long-running schedd that had processed a lot of
jobs, the job_queue.log file would grow to enormous size, and take a
long time to process upon startup.  Things would look frozen, and
administrators would (rightly) tend to panic.  :-)

Either upgrade to 6.6.10, or frequently 'condor_reconfig -schedd'.

Mike Yoder
Principal Member of Technical Staff
Direct : +1.408.321.9000
Fax    : +1.408.321.9030
Mobile : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com