[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Mysterious Unclaimed resources in cluster



Fellow condor users,

 

I have a medium sized (~150 cpus) windows cluster running condor version 7.6.1.  Recently, I have noticed that I cannot utilize all of the resources.   A number of the cpu’s remain in a persistent “unclaimed” state.  The most relevant log entry I can find relating to this is in StartLog:

 

01/09/12 13:51:02 slot1: Changing state: Owner -> Unclaimed

01/09/12 13:51:02 slot2: State change: received RELEASE_CLAIM command

01/09/12 13:51:02 slot2: Changing state and activity: Claimed/Idle -> Preempting/Vacating

01/09/12 13:51:02 slot2: State change: No preempting claim, returning to owner

01/09/12 13:51:02 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle

01/09/12 13:51:02 slot2: State change: IS_OWNER is false

01/09/12 13:51:02 slot2: Changing state: Owner -> Unclaimed

 

This sequence repeats indefinitely for the resource in question.  My guess is that the RELEASE_CLAIM is the culprit, but is the origin of the RELEASE_CLAIM?  What’s truly mysterious, is that it will affect only a few cpus in a multiple core system, the rest of which are behaving normally.  I spent a few hours combing the log files and past forums, but have not been able to find a suitable solution to this problem.  Has anyone encountered this before?  Any solutions?

 

Thanks

 

Eric