[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] fetchwork vs. claim_worklife





On 4/12/11 12:46 PM, Carsten Aulbert wrote:
Hi Dan

On Tuesday 12 April 2011 16:58:52 Dan Bradley wrote:
I am puzzled about why preemption is ineffective in the case where the
work-fetch job has higher rank than the existing claim.  What version of
condor is this?

Version 7.4.4
But I was not aware that preemption is needed to claim an idle slot

The logs you posted showed the slot transitioning to Claimed/Idle, not Unclaimed/Idle. Therefore, the work-fetch job must preempt the claim of the schedd that is holding it. I can't think of any reason why the schedd would hold the claim after a job completes without starting another job for an hour other than the schedd being very very busy. Perhaps it would be worth looking into what exactly is going on with that. One place to start would be the shadow log. Look at the shadow that ran the job that ran on the claim before it transitioned to Claimed/Idle for a long period of time. Did the shadow exit cleanly? In the schedd log, can you see the schedd handling the exit of that shadow? It should immediately launch another job on the claim at that point.

I am also curious why claims are sitting in Claimed/Idle for so long
after a job finishes.  Is the schedd severely overloaded?
Not really - as far as I can tell, busy as usual with<  ~50% CPU time on a
single node

The schedd is single-threaded. It is possible for the cpu to be not very busy but for the schedd to be having performance problems due to disk i/o or blocking network communications. Is the schedd responsive to condor_q queries?

--Dan