Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Ever-increasing userprio.

Date: Mon, 06 Feb 2012 22:32:37 -0500
From: Matthew Farrellee <matt@xxxxxxxxxx>
Subject: Re: [Condor-users] Ever-increasing userprio.

On 02/01/2012 11:59 AM, Amy Bush wrote:

Let me preface this with the obligatory: I just recently took over the
care and feeding of an established condor cluster when its previous
caretaker left. I came in knowing nearly nothing and have managed to
muddle my way through most problems so far. This one has me baffled,
though, and so far searching hasn't turned up anyone reporting a similar
problem, so I come to you guys.

Background:
Yesterday someone reported condor_q failing on one of our submit nodes.
A little investigation showed the scheduler wasn't running on said node,
and was segfaulting/core dumping each time condor was restarted.

After some poking and searching, eventually I followed someone's
brute-force advice and moved the spool job_queue.log out of the way.
After doing that, I was able to start the scheduler again successfully.

You should never have to do this. Please pass along the version ofcondor you are running (use condor_version) and the stack trace from theSchedLog.

MEANwhile a user reported that he had discovered he had 909 jobs that
were in the X state, and he couldn't rm them, and it appeared he
couldn't do that because of the scheduler being down on this submit
node. Once it was back up, I successfully got rid of his X jobs.

Did you condor_rm -forcex to get rid of them? If so there may be someleftover bits not cleaned up on remote machines...

However, this whole thing wreaked havoc on said user's userprio. I
manually set it back down to a lower value, and he seemed happy.

Today he reports that his userprio continues to climb, despite not
running any jobs.

I've confirmed he's not running any jobs (at least according to
'condor_q -g -submitter'), and I've confirmed that his userprio keeps
growing.

I'm not even sure what to look for to solve this.

...So you might check to see if any machines think he's still usingthem: condor_status -run


Best,


matt

References:
- [Condor-users] Ever-increasing userprio.
  - From: Amy Bush

Prev by Date: Re: [Condor-users] make collector more aggresive
Next by Date: Re: [Condor-users] job's ouput analyzing
Previous by thread: [Condor-users] Ever-increasing userprio.
Next by thread: [Condor-users] condor_hdfs wrapper moving to contrib section
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Ever-increasing userprio.