I have a follow up to this issue. After additional troubleshooting, I’ve discovered that the unclaimed resources move from one machine to another, so I can rule out any Class Ad incompatibility. One thing I have noticed, is that the maximum number of Claimed resources seems to be about 85-88. Modifying the MaxJobsRunning variable doesn’t help. Setting it to 50 limits the number of running jobs to 50, but no matter how high I set it, the number of running jobs is still about 85. Currently I am the only user on the pool, and I am also the administrator. If anyone has any log file, command, or utility I can use to try to identify the problem, I would much appreciate it.
Fellow condor users,
I have a medium sized (~150 cpus) windows cluster running condor version 7.6.1. Recently, I have noticed that I cannot utilize all of the resources. A number of the cpu’s remain in a persistent “unclaimed” state. The most relevant log entry I can find relating to this is in StartLog:
01/09/12 13:51:02 slot1: Changing state: Owner -> Unclaimed
01/09/12 13:51:02 slot2: State change: received RELEASE_CLAIM command
01/09/12 13:51:02 slot2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
01/09/12 13:51:02 slot2: State change: No preempting claim, returning to owner
01/09/12 13:51:02 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle
01/09/12 13:51:02 slot2: State change: IS_OWNER is false
01/09/12 13:51:02 slot2: Changing state: Owner -> Unclaimed
This sequence repeats indefinitely for the resource in question. My guess is that the RELEASE_CLAIM is the culprit, but is the origin of the RELEASE_CLAIM? What’s truly mysterious, is that it will affect only a few cpus in a multiple core system, the rest of which are behaving normally. I spent a few hours combing the log files and past forums, but have not been able to find a suitable solution to this problem. Has anyone encountered this before? Any solutions?