[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Ever-increasing userprio.

On 02/01/2012 11:59 AM, Amy Bush wrote:
Let me preface this with the obligatory: I just recently took over the
care and feeding of an established condor cluster when its previous
caretaker left. I came in knowing nearly nothing and have managed to
muddle my way through most problems so far. This one has me baffled,
though, and so far searching hasn't turned up anyone reporting a similar
problem, so I come to you guys.

Yesterday someone reported condor_q failing on one of our submit nodes.
A little investigation showed the scheduler wasn't running on said node,
and was segfaulting/core dumping each time condor was restarted.

After some poking and searching, eventually I followed someone's
brute-force advice and moved the spool job_queue.log out of the way.
After doing that, I was able to start the scheduler again successfully.

You should never have to do this. Please pass along the version of condor you are running (use condor_version) and the stack trace from the SchedLog.

MEANwhile a user reported that he had discovered he had 909 jobs that
were in the X state, and he couldn't rm them, and it appeared he
couldn't do that because of the scheduler being down on this submit
node. Once it was back up, I successfully got rid of his X jobs.

Did you condor_rm -forcex to get rid of them? If so there may be some leftover bits not cleaned up on remote machines...

However, this whole thing wreaked havoc on said user's userprio. I
manually set it back down to a lower value, and he seemed happy.

Today he reports that his userprio continues to climb, despite not
running any jobs.

I've confirmed he's not running any jobs (at least according to
'condor_q -g -submitter'), and I've confirmed that his userprio keeps

I'm not even sure what to look for to solve this.

...So you might check to see if any machines think he's still using them: condor_status -run