[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Machines in state claimed/idle forever



I'm facing a problem with machines remaining in the claimed/idle state
forever. I guess it has something to do with my configuration of condor
and hope that someone has some idea. 

I'm using condor in a configuration where there's only one dedicated
scheduler for the whole pool (which also runs negotiator and collector).
All users are supposed to submit their jobs to this central scheduler.
It serves for both "normal" jobs (vanilla) using just one node but also
for MPI jobs (parallel) which need to reserve multiple nodes to run.
PREEMPTION is switched off completely:

PREEMPT = false
PREEMPTION_REQUIREMENTS= false

Now, in a situation where the scheduler is claiming resources for an MPI
job these resources go first into a "claimed/idle" state before the
scheduler has accumulated enough resources to start the job. If I decide
now to put the MPI job on hold before it actually runs the machines stay
in the claimed/idle state even though there's no job anymore to run. If
I submit any vanilla job it won't run as well on this machine because it
is rejected as the previous claim remains active. Basically, the machine
will remain in the claimed/idle state forever. I can solve the problem
by restarting condor_startd on the claimed machines. Then, it forgets
about the claim. Otherwise it will stay claimed/idle and thus be
blocked. 
Has anyone seen this behavior? Is there any recommendation about how to
configure condor for such a setup?