[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] slot1 resources disappear after a few days.



John,

 

I was hoping that the UDP issues had been resolved by now; my previous Condor 6.6.x pool was using TCP updates because of this issue.   Slot2 never seems to be affected by this… do you still think UDP updates are to blame?  I suppose it doesn’t hurt to give it a try.

 

Thanks,

 

Bryan

 

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Kewley, J (John)
Sent: Friday, March 28, 2008 1:16 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] slot1 resources disappear after a few days.

 

I think this is the known problem with udp updates for Windows machines in general.

A fair few sites have mentioned problems like this in the past when whole

machines used to vanish. Now that there are more and more multi-slot machines,

it appears that some of the slots report OK and some don't.

 

If you check previous posts in this forum you'll see some suggestions from the

Condor team, but I think the only thing that I found to work was enabling

tcp rather than udp for the classad heartbeat.

 

Cheers

 

JK

 


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of carl langlois
Sent: Friday, March 28, 2008 3:51 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] slot1 resources disappear after a few days.

Hi Bryan,

Do you have any core.something.WIN32 in your log directory? I got a similar problem that some slot disappear from the pool at one point in time and have notice to core file in the log directory. But don't know why it append.


Carl


On Fri, Mar 28, 2008 at 11:02 AM, Bryan S. Maher <Bryan.Maher@xxxxxxxxxx> wrote:

Hi All:

 

I have a new Condor pool uniformly running v7.0.1 on Windows.   After a day or two the slot1 resources fail to show up when issuing a condor_status command.  Here is sample output:

 

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

 

slot1@xxxxxxxxxxx. WINNT51    INTEL  Owner     Idle     0.030  1023  0+04:32:59

slot2@xxxxxxxxxxx. WINNT51    INTEL  Owner     Idle     0.000  1023  0+04:33:00

slot2@xxxxxxxxxxxx WINNT51    INTEL  Owner     Idle     0.000  1534  0+04:35:05

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  5+14:26:38

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:07

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:05

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:05

slot2@xxxxxxxxxxxx WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:07

 

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

 

       INTEL/WINNT51     3     3       0         0       0          0        0

       INTEL/WINNT52     5     0       0         5       0          0        0

 

               Total     8     3       0         5       0          0        0

 

As you can see, even the totals fail to count the slot1 resources.  A condor_reconfig is sufficient to bring slot1 back to life.   The StartLog on an affected machine looks like:

 

3/17 12:03:18 ******************************************************

3/17 12:03:18 ** condor_startd.exe (CONDOR_STARTD) STARTING UP

3/17 12:03:18 ** C:\condor\bin\condor_startd.exe

3/17 12:03:18 ** $CondorVersion: 7.0.1 Feb 27 2008 BuildID: 76180 $

3/17 12:03:18 ** $CondorPlatform: INTEL-WINNT50 $

3/17 12:03:18 ** PID = 1880

3/17 12:03:18 ** Log last touched 3/17 11:01:32

3/17 12:03:18 ******************************************************

3/17 12:03:18 Using config source: C:\condor\condor_config

3/17 12:03:18 Using local config sources:

3/17 12:03:18    C:\condor\condor_config.local

3/17 12:03:18 DaemonCore: Command Socket at <x.x.x.x:1071>

3/17 12:03:18 MachAttributes::publish: failed to get Windows version information

3/17 12:03:24 slot1: New machine resource allocated

3/17 12:03:24 slot2: New machine resource allocated

3/17 12:03:29 About to run initial benchmarks.

3/17 12:03:33 Completed initial benchmarks.

.

.  slot2 continues to run benchmarks, slot1 never runs benchmarks …

.

3/17 12:03:33 slot2: State change: IS_OWNER is false

3/17 12:03:33 slot2: Changing state: Owner -> Unclaimed

3/17 12:03:33 slot1: State change: IS_OWNER is false

3/17 12:03:33 slot1: Changing state: Owner -> Unclaimed

3/17 16:03:33 State change: RunBenchmarks is TRUE

3/17 16:03:33 slot2: Changing activity: Idle -> Benchmarking

3/17 16:03:36 State change: benchmarks completed

3/17 16:03:36 slot2: Changing activity: Benchmarking -> Idle

3/17 20:03:36 State change: RunBenchmarks is TRUE

3/17 20:03:36 slot2: Changing activity: Idle -> Benchmarking

3/17 20:03:39 State change: benchmarks completed

.

.  reconfig sent, slot1 begins to run benchmarks in lieu of slot2

.  slot1 is reappears in condor_status for a while …

.

3/22 21:50:06 Got SIGHUP.  Re-reading config files.

3/23 00:10:06 State change: RunBenchmarks is TRUE

3/23 00:10:06 slot1: Changing activity: Idle -> Benchmarking

3/23 00:10:10 State change: benchmarks completed

3/23 00:10:10 slot1: Changing activity: Benchmarking -> Idle

3/23 04:10:10 State change: RunBenchmarks is TRUE

3/23 04:10:10 slot1: Changing activity: Idle -> Benchmarking

3/23 04:10:14 State change: benchmarks completed

3/23 04:10:14 slot1: Changing activity: Benchmarking -> Idle

.

.  slot1 benchmarks continue but slot1 is no longer visible in condor_status …

.

3/28 04:12:18 slot1: Changing activity: Benchmarking -> Idle

3/28 08:12:19 State change: RunBenchmarks is TRUE

3/28 08:12:19 slot1: Changing activity: Idle -> Benchmarking

3/28 08:12:22 State change: benchmarks completed

3/28 08:12:22 slot1: Changing activity: Benchmarking -> Idle

<end>

 

Any ideas?

 

-Bryan

 

 


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/