[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Strange claimed problems [SEC=UNCLASSIFIED]



Title: Strange claimed problems [SEC=UNCLASSIFIED]
On Thursday, 24 November, 2011 at 10:51 PM, Van de Meulen-Graaf, Zane (Contractor) wrote:

UNCLASSIFIED

Hi All,

I've been having some very strange problems with non-existent jobs claiming processors. Here's what has happened:

1) Submitted a batch of jobs (somewhere around 10000). Soon after, I realized I'd made a mistake with the executable, so I went to remove them with condor_rm. This removed all the jobs, but the ones that were currently running were only marked as deleted, and showed up in the queue as such (marked "X").

2) Fixed the executable and resubmitted the jobs. Was running rather slowly, checked condor_status and saw a lot of the claimed nodes were idle, which often happens anyway for some unknown reason.

3) Let all the resubmitted jobs finish running. This is maybe a day or so later, and the first set of removed jobs were still showing up in the queue. Decided to do a condor_rm -all -forcex to really get rid of them. This works, condor_q is now empty.

4) Go back to check condor_status. A majority of processors are -still- sitting in Claimed/Idle, even though there are no jobs! If I do a condor_status -claimed, they are all claimed by me, suggesting they're still claimed by the first batch of jobs that (should have) been removed/deleted.

Note that no other jobs have been submitted/deleted other than these during this time.

I was wondering if anyone has seen this odd behaviour before, and if there is any way of fixing it (short of rebooting condor master).


The dreaded Claimed+Idle state used to be a lot more prevalent in years gone by. Thankfully it's something we don't have to think about very often any more.

In the past it was generally a symptom of an over-burdened scheduler. A condor_schedd process that couldn't keep up with the demands of job spawn rate and user-side queries and submissions could end up putting a bunch of condor_startds in claimed+idle.

I'll still see this happen, from time to time, on the 7.6.x series but it's really rare. It usually requires something extreme (and I'd say removing 10k jobs in one go is a bit on extreme side as far as Condor is concerned) tying up the scheduler. In the 7.6.x series I have a sneaky suspicion there's some sort of deadlock condition that's leaving machines claimed+idle, but I can't reproduce it reliably…yet.

Restarting the claimed+idle startds is a sure fix. But before you do that you can and "kick" them. Issue a "condor_reconfig -full -startd" against them. I've found in most cases this is enough to wake up the startd, have it re-evaluate its state, realize the job isn't ever going to be passed to the starter, and put itself back in the unclaimed+idle state.

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing