[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Ever-increasing userprio.

Let me preface this with the obligatory: I just recently took over the
care and feeding of an established condor cluster when its previous
caretaker left. I came in knowing nearly nothing and have managed to
muddle my way through most problems so far. This one has me baffled,
though, and so far searching hasn't turned up anyone reporting a similar
problem, so I come to you guys.

Yesterday someone reported condor_q failing on one of our submit nodes.
A little investigation showed the scheduler wasn't running on said node,
and was segfaulting/core dumping each time condor was restarted.

After some poking and searching, eventually I followed someone's
brute-force advice and moved the spool job_queue.log out of the way.
After doing that, I was able to start the scheduler again successfully.

MEANwhile a user reported that he had discovered he had 909 jobs that
were in the X state, and he couldn't rm them, and it appeared he
couldn't do that because of the scheduler being down on this submit
node. Once it was back up, I successfully got rid of his X jobs.

However, this whole thing wreaked havoc on said user's userprio. I
manually set it back down to a lower value, and he seemed happy.

Today he reports that his userprio continues to climb, despite not
running any jobs. 

I've confirmed he's not running any jobs (at least according to
'condor_q -g -submitter'), and I've confirmed that his userprio keeps

I'm not even sure what to look for to solve this.

Hoping for something obvious that just grants credence to my claims of
ignorance, and will take any suggestions anyone might have.