[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Did something change in the 7.2.2 Windows release?- I can't see startds any more



Ian,

Does your collector log say anything strange? Specifically, look for a line containing "REJECTION". Some code was added to reject incompatible ads from newer (7.3+) daemons that are using CCB. If that is getting triggered in your case, then something has gone wrong.

--Dan

Ian Chesal wrote:
I just upgraded my binaries from the 7.2.2 pre-release binaries to the
7.2.2 officially released binaries and my two Windows machines in my
test pool are no longer working correctly. They're failing to send the
startd ads to my collector. The daemons start up find and I see the
correct processes:

D:\arc\condor\log>ps -ef |grep condor_
  SYSTEM    852    696  0 07:59:47 con  0:00
d:\arc\condor\bin\condor_master.exe
  SYSTEM   1256    852  0 07:59:48 con  0:00 condor_procd.exe -A
//./pipe/procd_pipe -L d:/arc/condor/log/ProcdLog -K
d:/arc/condor/bin\condor_softkill.exe
  SYSTEM    204    852  0 07:59:48 con  0:08 condor_startd.exe -f

And the StartLog is sane:

4/16 07:59:48 ******************************************************
4/16 07:59:48 ** condor_startd.exe (CONDOR_STARTD) STARTING UP
4/16 07:59:48 ** d:\arc\condor\bin\condor_startd.exe
4/16 07:59:48 ** SubsystemInfo: name=STARTD type=STARTD(7)
class=DAEMON(1)
4/16 07:59:48 ** Configuration: subsystem:STARTD local:<NONE>
class:DAEMON
4/16 07:59:48 ** $CondorVersion: 7.2.2 Apr  9 2009 BuildID: 145189 $
4/16 07:59:48 ** $CondorPlatform: INTEL-WINNT50 $
4/16 07:59:48 ** PID = 204
4/16 07:59:48 ** Log last touched time unavailable (No such file or
directory)
4/16 07:59:48 ******************************************************
4/16 07:59:48 Using config source: \\sv129\arc\condor\condor_config
4/16 07:59:48 Using local config sources:
4/16 07:59:48    \\sv129\arc\condor/condor_config.basic
4/16 07:59:48    \\sv129\arc\condor/os/condor_config.WINNT51
4/16 07:59:48    \\sv129\arc\condor/site/condor_config.SJDEV
4/16 07:59:48    \\sv129\arc\condor/machine/condor_config.sj-bs3400-272
4/16 07:59:48
\\sv129\arc\condor/machine/condor_config.sj-bs3400-272.WINNT51
4/16 07:59:48    \\sv129\arc\condor/patch/condor_config.sj-bs3400-272
4/16 07:59:48
\\sv129\arc\condor/patch/condor_config.sj-bs3400-272.WINNT51
4/16 07:59:48    \\sv129\arc\condor/cycleserver/sj-bs3400-272.config
4/16 07:59:48 DaemonCore: Command Socket at <137.57.203.81:4807>
4/16 07:59:48 slot1: New machine resource of type 2 allocated
4/16 07:59:48 slot2: New machine resource of type 3 allocated
4/16 07:59:53 About to run initial benchmarks.
4/16 08:00:01 Completed initial benchmarks.
4/16 08:00:01 Cron: Initializing job 'update'
(d:/arc/scripts/hooks/update_hooks_and_modules.bat)
4/16 08:00:01 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe  /Q /C
"d:/arc/scripts/hooks/update_hooks_and_modules.bat"  update
4/16 08:00:01 slot2: State change: IS_OWNER is false
4/16 08:00:01 slot2: Changing state: Owner -> Unclaimed
4/16 08:00:01 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe  /Q /C
"d:/arc/scripts/hooks/arc_job_fetch.bat"
4/16 08:00:01 slot1: State change: IS_OWNER is false
4/16 08:00:01 slot1: Changing state: Owner -> Unclaimed
4/16 08:00:01 Executable is a batch script, so executing
C:\WINDOWS\system32\cmd.exe  /Q /C
"d:/arc/scripts/hooks/arc_job_fetch.bat"
4/16 08:00:02 Calling pipe Handler <Guarantee all data written to pipe>
for Pipe end=65539 <DC stdin pipe>
4/16 08:00:02 Return from pipe Handler
4/16 08:00:02 Calling pipe Handler <Guarantee all data written to pipe>
for Pipe end=65541 <DC stdin pipe>
4/16 08:00:02 Return from pipe Handler
4/16 08:00:02 Received UDP command 60011 (DC_NOP) from
<137.57.203.81:4810>, access level READ
4/16 08:00:02 Calling HandleReq <handle_nop()> (0)
4/16 08:00:02 Return from HandleReq <handle_nop()> (handler: 0.000s,
sec: 0.031s)

I can talk to the collector *from* the machine:

D:\arc\condor\log>condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem
ActvtyTime

sqal08.altera.com  LINUX      INTEL  Unclaimed Idle     0.020  2026
0+03:42:04
sv129.altera.com   LINUX      INTEL  Owner     Idle     0.290  2024
1+17:34:16
slot1@sj-bs3400-31 LINUX      X86_64 Unclaimed Idle     0.230  1224
0+03:11:05
slot2@sj-bs3400-31 LINUX      X86_64 Unclaimed Idle     0.000  1224
0+19:11:52
slot3@sj-bs3400-31 LINUX      X86_64 Unclaimed Idle     0.000   750
0+19:11:53
slot4@sj-bs3400-31 LINUX      X86_64 Unclaimed Idle     0.000   750
0+19:11:54
slot1@sj-bs3400-31 LINUX      X86_64 Unclaimed Idle     0.370  1224
0+03:34:07
slot2@sj-bs3400-31 LINUX      X86_64 Unclaimed Idle     0.000  1224
0+19:34:54
slot3@sj-bs3400-31 LINUX      X86_64 Unclaimed Idle     0.000   750
0+19:34:55
slot4@sj-bs3400-31 LINUX      X86_64 Unclaimed Idle     0.000   750
0+19:34:56
slot1@sqal64-36-te LINUX      X86_64 Unclaimed Idle     0.160  1264
0+03:32:08
slot2@sqal64-36-te LINUX      X86_64 Unclaimed Idle     0.000   742
0+19:32:36
slot1@sqal64-37-te LINUX      X86_64 Unclaimed Idle     0.520  1224
0+03:25:10
slot2@sqal64-37-te LINUX      X86_64 Unclaimed Idle     0.000  1224
0+19:25:52
slot3@sqal64-37-te LINUX      X86_64 Unclaimed Idle     0.000   750
0+19:25:52
slot4@sqal64-37-te LINUX      X86_64 Unclaimed Idle     0.000   750
0+19:25:53

                     Total Owner Claimed Unclaimed Matched Preempting
Backfill

         INTEL/LINUX     2     1       0         1       0          0
0
        X86_64/LINUX    14     0       0        14       0          0
0

               Total    16     1       0        15       0          0
0

But you see no startd entries from my Windows machines in the
collector's view of the world:

D:\arc\condor\log>condor_status -any

MyType               TargetType           Name

DaemonMaster         None                 sj-bs3400-272.altera.com
DaemonMaster         None                 sj-bs3400-279.altera.com
DaemonMaster         None                 sj-bs3400-311.altera.com
Machine              Job                  sqal08.altera.com
Machine              Job                  sv129.altera.com
Machine              Job                  slot1@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot2@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot3@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot4@xxxxxxxxxxxxxxxxxxxxxxxx
DaemonMaster         None                 sj-bs3400-312.altera.com
Machine              Job                  slot1@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot2@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot3@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot4@xxxxxxxxxxxxxxxxxxxxxxxx
DaemonMaster         None                 sqal08.altera.com
Machine              Job                  slot1@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot2@xxxxxxxxxxxxxxxxxxxxxxxx
DaemonMaster         None                 sqal64-36-test.altera.com
Machine              Job                  slot1@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot2@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot3@xxxxxxxxxxxxxxxxxxxxxxxx
Machine              Job                  slot4@xxxxxxxxxxxxxxxxxxxxxxxx
DaemonMaster         None                 sqal64-37-test.altera.com
Negotiator           None                 sv129.altera.com
DaemonMaster         None                 sv129.altera.com

And -direct returns nothing, the command times out:

D:\arc\condor\log>condor_status -direct localhost -debug
4/16 08:11:31 condor_read(): timeout reading 5 bytes from
<137.57.203.81:4807>.
4/16 08:11:31 IO: Failed to read packet header

If I swap out the 7.2.2 pre-release binaries I had for the official
release binaries I just downloaded (the .zip bundle BTW) everything
functions perfectly:

D:\tmp>condor_status -direct localhost

Name               OpSys      Arch   State     Activity LoadAv Mem
ActvtyTime

slot1@sj-bs3400-27 WINNT51    INTEL  Unclaimed Idle     0.470  2257
0+00:01:54
slot2@sj-bs3400-27 WINNT51    INTEL  Unclaimed Idle     0.000  1325
0+00:01:54

                     Total Owner Claimed Unclaimed Matched Preempting
Backfill

       INTEL/WINNT51     2     0       0         2       0          0
0

               Total     2     0       0         2       0          0
0

The other odd thing I noticed is running 'net stop condor' fails to kill
Condor off on the machine. I have to kill the condor_* processes
manually.

The pre-release binaries I was testing were:

D:\tmp>condor_version
$CondorVersion: 7.2.2 Mar 20 2009 BuildID: none PRE-RELEASE-UWCS $
$CondorPlatform: INTEL-WINNT50 $

So something after March 20th? I've reverted to the pre-release binaries
on my Windows machines for now.

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/