[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] slot1 resources disappear after a few days.



Hi All:

 

I have a new Condor pool uniformly running v7.0.1 on Windows.   After a day or two the slot1 resources fail to show up when issuing a condor_status command.  Here is sample output:

 

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

 

slot1@xxxxxxxxxxxx WINNT51    INTEL  Owner     Idle     0.030  1023  0+04:32:59

slot2@xxxxxxxxxxxx WINNT51    INTEL  Owner     Idle     0.000  1023  0+04:33:00

slot2@xxxxxxxxxxxx WINNT51    INTEL  Owner     Idle     0.000  1534  0+04:35:05

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  5+14:26:38

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:07

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:05

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:05

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:07

 

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

 

       INTEL/WINNT51     3     3       0         0       0          0        0

       INTEL/WINNT52     5     0       0         5       0          0        0

 

               Total     8     3       0         5       0          0        0

 

As you can see, even the totals fail to count the slot1 resources.  A condor_reconfig is sufficient to bring slot1 back to life.   The StartLog on an affected machine looks like:

 

3/17 12:03:18 ******************************************************

3/17 12:03:18 ** condor_startd.exe (CONDOR_STARTD) STARTING UP

3/17 12:03:18 ** C:\condor\bin\condor_startd.exe

3/17 12:03:18 ** $CondorVersion: 7.0.1 Feb 27 2008 BuildID: 76180 $

3/17 12:03:18 ** $CondorPlatform: INTEL-WINNT50 $

3/17 12:03:18 ** PID = 1880

3/17 12:03:18 ** Log last touched 3/17 11:01:32

3/17 12:03:18 ******************************************************

3/17 12:03:18 Using config source: C:\condor\condor_config

3/17 12:03:18 Using local config sources:

3/17 12:03:18    C:\condor\condor_config.local

3/17 12:03:18 DaemonCore: Command Socket at <x.x.x.x:1071>

3/17 12:03:18 MachAttributes::publish: failed to get Windows version information

3/17 12:03:24 slot1: New machine resource allocated

3/17 12:03:24 slot2: New machine resource allocated

3/17 12:03:29 About to run initial benchmarks.

3/17 12:03:33 Completed initial benchmarks.

.

.  slot2 continues to run benchmarks, slot1 never runs benchmarks …

.

3/17 12:03:33 slot2: State change: IS_OWNER is false

3/17 12:03:33 slot2: Changing state: Owner -> Unclaimed

3/17 12:03:33 slot1: State change: IS_OWNER is false

3/17 12:03:33 slot1: Changing state: Owner -> Unclaimed

3/17 16:03:33 State change: RunBenchmarks is TRUE

3/17 16:03:33 slot2: Changing activity: Idle -> Benchmarking

3/17 16:03:36 State change: benchmarks completed

3/17 16:03:36 slot2: Changing activity: Benchmarking -> Idle

3/17 20:03:36 State change: RunBenchmarks is TRUE

3/17 20:03:36 slot2: Changing activity: Idle -> Benchmarking

3/17 20:03:39 State change: benchmarks completed

.

.  reconfig sent, slot1 begins to run benchmarks in lieu of slot2

.  slot1 is reappears in condor_status for a while …

.

3/22 21:50:06 Got SIGHUP.  Re-reading config files.

3/23 00:10:06 State change: RunBenchmarks is TRUE

3/23 00:10:06 slot1: Changing activity: Idle -> Benchmarking

3/23 00:10:10 State change: benchmarks completed

3/23 00:10:10 slot1: Changing activity: Benchmarking -> Idle

3/23 04:10:10 State change: RunBenchmarks is TRUE

3/23 04:10:10 slot1: Changing activity: Idle -> Benchmarking

3/23 04:10:14 State change: benchmarks completed

3/23 04:10:14 slot1: Changing activity: Benchmarking -> Idle

.

.  slot1 benchmarks continue but slot1 is no longer visible in condor_status …

.

3/28 04:12:18 slot1: Changing activity: Benchmarking -> Idle

3/28 08:12:19 State change: RunBenchmarks is TRUE

3/28 08:12:19 slot1: Changing activity: Idle -> Benchmarking

3/28 08:12:22 State change: benchmarks completed

3/28 08:12:22 slot1: Changing activity: Benchmarking -> Idle

<end>

 

Any ideas?

 

-Bryan