[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] /var/lib/condor/spool usage



On 2015-04-02 09:08, Richard Pieri wrote:
On 4/1/15 8:36 PM, Dimitri Maziuk wrote:
Of course the other interesting question is why this submit node ran
just fine for a couple of years and this afternoon decided to write
~100GB of spool all of a sudden. I temporarily shut down all condor

IME, problems like this are typically user error like accidentally
mailing a 600MB CD image to 150 people at a site.

Or a particular input sent the application into a tailspin, yes. Only in this case there's just a bunch of spool directories whose total went from 40G to 100G in about 15 minutes with no obvious outliers. Nothing stands out.

In practice I have no control over what users submit and I tend to have a limited amount of time and effort for reading the logs. So I'll probably declare it a one-off freak occurrence, condor_rm all the jobs and let the users keep the pieces. If this keeps happening I'll keep reducing MAX_JOBS_RUNNING until it stops. That way we're not going to be utilizing available hpc resources to the full but we'll keep running. As opposed to being dead in the water.

Dimitri