[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow Exception on Windows 2003 server



I have just recently installed Condor 6.6.6 on a Windows 2003 server, and
when trying to run the "printname" example got the shadow exception error.
After condor_submit the name.sub file, the job would stay running for about
20 sec, and then become idle on the condor_q. The execution log file records
multiple time of the following message

001 (025.000.000) 08/09 11:13:17 Job executing on host: <128.32.62.44:2650>
...
007 (025.000.000) 08/09 11:13:17 Shadow exception!
	Can no longer talk to condor_starter on execute machine
(128.32.62.44)
	0  -  Run Bytes Sent By Job
	98  -  Run Bytes Received By Job

In addition, the shadow.log file records

8/9 11:13:14 ******************************************************
8/9 11:13:14 ** condor_shadow (CONDOR_SHADOW) STARTING UP
8/9 11:13:14 ** C:\Condor\bin\condor_shadow.exe
8/9 11:13:14 ** $CondorVersion: 6.6.6 Jul 26 2004 $
8/9 11:13:14 ** $CondorPlatform: INTEL-WINNT40 $
8/9 11:13:14 ** PID = 820
8/9 11:13:14 ******************************************************
8/9 11:13:14 Using config file: C:\Condor\condor_config
8/9 11:13:14 Using local config files: C:\Condor/condor_config.local
8/9 11:13:14 DaemonCore: Command Socket at <128.32.62.44:4544>
8/9 11:13:14 Initializing a VANILLA shadow
8/9 11:13:14 (24.0) (2628): Request to run on <128.32.62.220:1149> was
ACCEPTED
8/9 11:13:15 Initializing a VANILLA shadow
8/9 11:13:15 (25.0) (820): Request to run on <128.32.62.44:2650> was
ACCEPTED
8/9 11:13:17 (25.0) (820): condor_read(): recv() returned -1, errno = 10054,
assuming failure.
8/9 11:13:17 (25.0) (820): ERROR "Can no longer talk to condor_starter on
execute machine (128.32.62.44)" at line 63 in file
..\src\condor_shadow.V6.1\NTreceivers.C


The Starter.log file records

8/9 11:13:15 DaemonCore: Command received via TCP from host
<128.32.62.44:4557>
8/9 11:13:15 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
8/9 11:13:15 vm2: Got activate_claim request from shadow
(<128.32.62.44:4557>)
8/9 11:13:15 vm2: Remote job ID is 25.0
8/9 11:13:15 vm2: Got universe "VANILLA" (5) from request classad
8/9 11:13:15 vm2: State change: claim-activation protocol successful
8/9 11:13:15 vm2: Changing activity: Idle -> Busy
8/9 11:13:17 DaemonCore: Command received via UDP from host
<128.32.62.44:4568>
8/9 11:13:17 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling
handler (HandleProcessExitCommand())
8/9 11:13:17 Starter pid 1336 exited with status 4
8/9 11:13:17 vm2: State change: starter exited
8/9 11:13:17 vm2: Changing activity: Busy -> Idle
8/9 11:13:17 DaemonCore: Command received via UDP from host
<128.32.62.44:4569>
8/9 11:13:17 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_handler)
8/9 11:13:17 vm2: State change: received RELEASE_CLAIM command
8/9 11:13:17 vm2: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
8/9 11:13:17 vm2: State change: No preempting claim, returning to owner
8/9 11:13:17 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
8/9 11:13:17 vm2: State change: IS_OWNER is false
8/9 11:13:17 vm2: Changing state: Owner -> Unclaimed
8/9 11:13:17 DaemonCore: Command received via UDP from host
<128.32.62.44:4570>
8/9 11:13:17 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_handler)
8/9 11:13:17 Error: can't find resource with capability
(<128.32.62.44:2650>#3668732430)
8/9 11:13:17 DaemonCore: Command received via UDP from host
<128.32.62.44:4572>
8/9 11:13:17 DaemonCore: received command 60014 (DC_INVALIDATE_KEY), calling
handler (handle_invalidate_key())

Has anyone else had the similar problem on MS Windows 2003 server? Is there
any work around the problem? Thanks.


Chen Chang