[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Mysterious Unclaimed resources in cluster



I have a follow up to this issue.  After additional troubleshooting, I’ve discovered that the unclaimed resources move from one machine to another, so I can rule out any Class Ad incompatibility.  One thing I have noticed, is that the maximum number of Claimed resources seems to be about 85-88.  Modifying the MaxJobsRunning variable doesn’t help.  Setting it to 50 limits the number of running jobs to 50, but no matter how high I set it, the number of running jobs is still about 85.   Currently I am the only user on the pool, and I am also the administrator.  If anyone has any log file, command, or utility I can use to try to identify the problem, I would much appreciate it.

 

Thanks,

Eric

 

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Eric Abel
Sent: Monday, January 09, 2012 1:56 PM
To: Condor-Users Mail List
Subject: [Condor-users] Mysterious Unclaimed resources in cluster

 

Fellow condor users,

 

I have a medium sized (~150 cpus) windows cluster running condor version 7.6.1.  Recently, I have noticed that I cannot utilize all of the resources.   A number of the cpu’s remain in a persistent “unclaimed” state.  The most relevant log entry I can find relating to this is in StartLog:

 

01/09/12 13:51:02 slot1: Changing state: Owner -> Unclaimed

01/09/12 13:51:02 slot2: State change: received RELEASE_CLAIM command

01/09/12 13:51:02 slot2: Changing state and activity: Claimed/Idle -> Preempting/Vacating

01/09/12 13:51:02 slot2: State change: No preempting claim, returning to owner

01/09/12 13:51:02 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle

01/09/12 13:51:02 slot2: State change: IS_OWNER is false

01/09/12 13:51:02 slot2: Changing state: Owner -> Unclaimed

 

This sequence repeats indefinitely for the resource in question.  My guess is that the RELEASE_CLAIM is the culprit, but is the origin of the RELEASE_CLAIM?  What’s truly mysterious, is that it will affect only a few cpus in a multiple core system, the rest of which are behaving normally.  I spent a few hours combing the log files and past forums, but have not been able to find a suitable solution to this problem.  Has anyone encountered this before?  Any solutions?

 

Thanks

 

Eric