[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] claimed slots are idle




We are seeing an erratic problem on our cluster and wondered if this
rings any bells with any of you.

Summary
(0) The condor queue satisfactorily submits and runs jobs for hours or days, i.e.: jobs get queued and run to completion, then other jobs take the vacated slots.

(1) Then after some time (hours or days) we start noticing that claimed slots
aren't running running jobs: i.e.:
condor_status -claimed  shows load of 0.000 on a bunch (e.g. below)
of slots and no jobs are running on those nodes.
These slots are never released and never show up as un-claimed and never have running jobs.
Initially there will be a mixture of working("claimed and busy") nodes
and futile("claimed and idle" nodes, but the situation escalates to the point
that (almost?) all of the slots are claimed and idle,
and the load average on the entire cluster is near zero.
I need to confirm the following two:
- No shadow tasks run on Master for the claimed idle slots
- No shadow or user tasks run on the nodes associated with the claimed slots

Other Info.
* the nodes don't crash
*[root@vic ~]# condor_version
$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $
$CondorPlatform: X86_64-LINUX_RHEL5 $
* Restarting the condor daemon on the claimed idle node does not fix the problem. To fix the problem we have to stop condor on nodes, stop on master, clean spool directory , start on master, start on nodes. * All our routing, ping, name resolution, and portscan tests from working and non-working
clients and the master look normal.
* NFS work dirs.
* no abnormal loads on the NFS servers
* file and directory access on the work dirs is not compromised (ls and find run fast). * Example from condor_status -claimed: slot2@vic100. LINUX X86_64 0.000 some_user@vic vic.cluster


Darwin O.V. Alonso
dalonso@xxxxxxxxxxxxxxxx
Dept. Biochem. J558(HSB)
University of Washington
1705 NE Pacific St
Seattle WA 98195-7350