Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor: our main submit machine is running out of memory (our status page runs condor_q)

Date: Mon, 03 Dec 2012 11:52:16 -0600
From: Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor: our main submit machine is running out of memory (our status page runs condor_q)

On 12/03/2012 10:47 AM, Ian Cottam wrote:
> We are running 7.8.4.
> 
> The below is from a colleague, but basically when we are very busy on our
> main submit node
> (2000-3000 jobs) we see a problem when a condor_q occurs causing
> condor_schedd to fork, which, as it is fairly massive by then can cause us
> to run out of memory.

Not entirely dissimilar but probably unrelated: sometimes when our
submit node sends jobs out to OSG something causes condor_shadow to fork
in the fork-bomb fashion -- the machine even stops answering pings long
enough for nagios to notice. Adding more memory did not make it go away,
it just made happen very rarely. I'm unable to reproduce it of course.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu

Attachment: signature.asc
Description: OpenPGP digital signature

References:
- [HTCondor-users] Condor: our main submit machine is running out of memory (our status page runs condor_q)
  - From: Ian Cottam

Prev by Date: Re: [HTCondor-users] Condor: our main submit machine is running out of memory (our status page runs condor_q)
Next by Date: Re: [HTCondor-users] RANDOM_INTEGER problems on Windows
Previous by thread: Re: [HTCondor-users] Condor: our main submit machine is running out of memory (our status page runs condor_q)
Next by thread: Re: [HTCondor-users] Condor: our main submit machine is running out of memory (our status page runs condor_q)
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Condor: our main submit machine is running out of memory (our status page runs condor_q)