[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Trouble with a schedd getting out-of-sync with reality



(This is problem relates to 6.7.3 running on Windows XP)

I'm having persistent trouble with one user's schedd daemon reporting
mis-information about the state of running jobs from this machine. The
user has a large number of jobs scheduled (3041 jobs; 2955 idle, 86
running, 0 held). A large number of the jobs have had their requirements
set to restrict their machines to a set of 7 in our pool. All seven of
these machines are dual-processor machines with 2 VMs running on each of
them.

That means no more than 14 jobs can be running at the same time.
However, when I look at the condor_q output for this machine it's
reporting that more than 14 of the restricted jobs are running
simultaneously.

What's odd is that first thing this morning, when all these jobs were
queued up, the output from condor_q was great. It seems to have drifted
over time. So that more and more jobs are not reporting that they've
finished when you view the condor_q output.

She has 4 clusters with the following requirements set on each job in
the cluster:

((VirtualMemory >= ImageSize) && (Memory =!= UNDEFINED) && (Arch ==
"INTEL" && (OpSys == "WINNT40" || OpSys == "WINNT50" || OpSys ==
"WINNT51")) && (AlteraIsDesktop =?= FALSE) && ((AlteraMachineClass ==
866)) && ((Machine == "TTC-BS866-008.altera.com" || Machine ==
"TTC-BS866-011.altera.com" || Machine == "TTC-BS866-012.altera.com" ||
Machine == "TTC-BS866-013.altera.com" || Machine ==
"TTC-BS866-014.altera.com" || Machine == "TTC-BS866-015.altera.com" ||
Machine == "TTC-BS866-016.altera.com"))) && (Disk >= DiskUsage) &&
(HasFileTransfer)

If I query those four clusters for their running jobs I get more than 14
jobs returned: 

[0] > condor_q -name ttc-bchan2.altera.priv.altera.com -const
'JobStatus==2' 134 142 135 123

 
-- Schedd: TTC-BCHAN2.altera.priv.altera.com : <137.57.142.165:1045>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 134.67  bchan           1/28 18:13   0+03:49:53 R  20  573.4
wrapper.bat /exper
 134.86  bchan           1/28 18:13   0+02:42:44 R  20  1207.8
wrapper.bat /exper
 135.1   bchan           1/28 18:14   0+02:41:24 R  19  548.1
wrapper.bat /exper
 135.17  bchan           1/28 18:14   0+01:51:46 R  19  241.6
wrapper.bat /exper
 135.18  bchan           1/28 18:14   0+01:51:54 R  19  500.0
wrapper.bat /exper
 135.21  bchan           1/28 18:14   0+01:32:34 R  19  500.0
wrapper.bat /exper
 135.35  bchan           1/28 18:14   0+01:02:32 R  19  704.8
wrapper.bat /exper
 135.55  bchan           1/28 18:14   0+00:18:48 R  19  500.0
wrapper.bat /exper
 135.57  bchan           1/28 18:14   0+00:18:07 R  19  500.0
wrapper.bat /exper
 135.59  bchan           1/28 18:14   0+00:35:10 R  19  489.1
wrapper.bat /exper
 135.62  bchan           1/28 18:14   0+00:07:15 R  19  500.0
wrapper.bat /exper
 135.66  bchan           1/28 18:14   0+00:01:08 R  19  500.0
wrapper.bat /exper
 135.67  bchan           1/28 18:14   0+00:01:06 R  19  500.0
wrapper.bat /exper
 135.69  bchan           1/28 18:14   0+00:00:00 R  19  500.0
wrapper.bat /exper
 135.71  bchan           1/28 18:14   0+00:15:00 R  19  500.0
wrapper.bat /exper
 142.0   bchan           1/31 11:41   0+00:16:57 R  20  900.0
wrapper.bat /exper
 142.1   bchan           1/31 11:41   0+00:08:38 R  20  700.0
wrapper.bat /exper

Which is wrong. There are only 14 startd's available between those 7
machines:

[0] > condor_status -const
'Machine=="TTC-BS866-008.altera.com"||Machine=="TTC-BS866-011.altera.com
"||Machine=="TTC-BS866-012.altera.com"||Machine=="TTC-BS866-013.altera.c
om"||Machine=="TTC-BS866-014.altera.com"||Machine=="TTC-BS866-015.altera
.com"||Machine=="TTC-BS866-016.altera.com"'
 
Name          OpSys       Arch   State      Activity   LoadAv Mem
ActvtyTime
 
vm1@TTC-BS866 WINNT51     INTEL  Claimed    Retiring   0.980
1023[?????]
vm2@TTC-BS866 WINNT51     INTEL  Claimed    Retiring   1.020
1023[?????]
vm1@TTC-BS866 WINNT51     INTEL  Claimed    Retiring   1.660
1023[?????]
vm2@TTC-BS866 WINNT51     INTEL  Claimed    Retiring   0.370
1023[?????]
vm1@TTC-BS866 WINNT51     INTEL  Claimed    Busy       0.830
1023[?????]
vm2@TTC-BS866 WINNT51     INTEL  Claimed    Busy       1.190  1023
0+00:13:42
vm1@TTC-BS866 WINNT51     INTEL  Claimed    Retiring   1.570  1023
0+00:14:33
vm2@TTC-BS866 WINNT51     INTEL  Claimed    Retiring   1.520  1023
0+00:17:17
vm1@TTC-BS866 WINNT51     INTEL  Claimed    Busy       1.060  1023
0+00:14:14
vm2@TTC-BS866 WINNT51     INTEL  Claimed    Busy       1.050
1023[?????]
vm1@TTC-BS866 WINNT51     INTEL  Claimed    Busy       1.010
1023[?????]
vm2@TTC-BS866 WINNT51     INTEL  Claimed    Retiring   1.040  1023
0+00:15:52
vm1@TTC-BS866 WINNT51     INTEL  Claimed    Busy       0.500  1023
0+00:12:07
vm2@TTC-BS866 WINNT51     INTEL  Claimed    Idle       0.240
1023[?????]
 
                     Machines Owner Claimed Unclaimed Matched Preempting
 
       INTEL/WINNT51       14     0      14         0       0          0
 
               Total       14     0      14         0       0          0

How can this stale information be corrected? How did her schedd get into
such an inconsitent state? This has my users freaked out really.

- Ian