[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Collector's view of the world really out of sync with reality



I'm trying to debug a strange problem. We have a preemption policy that
allows jobs tagged as being from a particular to group to preempt any
other job on a machine. It works well at one site, but the same policy
implemented at another site is not functioning. The preemption is not
occurring.

I have a sneaking suspicion preemption is not occurring because the
startd stats at the collector are really out of date. For the machine
that should have it's job preempted I'm seeing:

[root@sj-negotiator log]# condor_status sj-bs3066-249

Name          OpSys       Arch   State      Activity   LoadAv Mem
ActvtyTime

vm1@sj-bs3066 WINNT51     INTEL  Claimed    Busy       0.000  1289
0+00:21:06
vm2@sj-bs3066 WINNT51     INTEL  Claimed    Busy       0.010   757
0+00:22:32

[root@sj-negotiator log]# condor_status -direct  sj-bs3066-249

Name          OpSys       Arch   State      Activity   LoadAv Mem
ActvtyTime

vm1@sj-bs3066 WINNT51     INTEL  Claimed    Busy       0.000  1289
0+00:06:02
vm2@sj-bs3066 WINNT51     INTEL  Claimed    Busy       0.000   757
0+00:07:27


I've never encountered such different information between the startd and
the collector before. I tried restarted the collector so the startd
table would get flushed and rebuilt. But when the machine data was
available in the collector again, it was still out of date.

Since our preemption policy references the job execution time I'm pretty
sure this is what's keeping jobs locked to this machine. The job
execution never seems to get reset for this machine. The jobs I'm
running on it now start and then sleep for exactly 10 minutes. So there
should never be a job on the machine that runs for >10 minutes. And
yet...the collector seems to think the machine has been running a job
for >20 minutes.

Any ideas on why my collector is so out of whack with reality?

- Ian

--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer

Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300