[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Shadow exception errors



Hi 

We have been setting up and experimenting with condor for a while
and now have some "real" users onboard using the system.

This user has submitted a number of jobs that keep trying to start,
fail and start again. There are shadow execption problems and eviction
problems. Just concentrating on the shadow exception problems for now
I have including logs from the submitting machine and from 2 different
execute machines. 

What problem is likely to cause these type of error messages?

The first example involves flocking to a different pool at a different
site. The second involves a jobs in the same pool, but machines still
at a physically different site. In both cases hardware firewalls (PIX's)
site between but we have set highport, lowport in the configs and
enabled
tcp/udp for the 9000-10000 port range.

Thanks.

Cheers

Greg

SHADOW LOG OF SUBMITTING MACHINE

2/13 10:54:09 ******************************************************
2/13 10:54:09 ** condor_shadow (CONDOR_SHADOW) STARTING UP
2/13 10:54:09 ** C:\Condor\bin\condor_shadow.exe
2/13 10:54:09 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/13 10:54:09 ** $CondorPlatform: INTEL-WINNT50 $
2/13 10:54:09 ** PID = 1268
2/13 10:54:09 ******************************************************
2/13 10:54:09 Using config file: c:\condor\condor_config
2/13 10:54:09 Using local config files: C:\Condor/condor_config.local
2/13 10:54:09 DaemonCore: Command Socket at <130.155.67.83:9091>
2/13 10:54:32 Initializing a VANILLA shadow
2/13 10:54:32 (7.0) (1268): Request to run on <130.116.147.52:9590> was
ACCEPTED
2/13 10:54:45 (7.0) (1268): ReliSock: put_file: Failed to open file
C:\Documents and Settings\odw010\.condorqueue\D78aUAA.egs, errno = 2.
2/13 10:54:45 (7.0) (1268): ERROR "DoUpload: Failed to send file
C:\Documents and Settings\odw010\.condorqueue\D78aUAA.egs, exiting at
1398
" at line 1397 in file ..\src\condor_c++_util\file_transfer.C
2/13 10:54:46 ******************************************************
2/13 10:54:46 ** condor_shadow (CONDOR_SHADOW) STARTING UP
2/13 10:54:46 ** C:\Condor\bin\condor_shadow.exe
2/13 10:54:46 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/13 10:54:46 ** $CondorPlatform: INTEL-WINNT50 $
2/13 10:54:46 ** PID = 2676
2/13 10:54:46 ******************************************************
2/13 10:54:47 Using config file: c:\condor\condor_config
2/13 10:54:47 Using local config files: C:\Condor/condor_config.local
2/13 10:54:47 DaemonCore: Command Socket at <130.155.67.83:9741>
2/13 10:55:09 Initializing a VANILLA shadow
2/13 10:55:09 (7.0) (2676): Request to run on <130.116.147.52:9590> was
ACCEPTED
2/13 10:55:14 (7.0) (2676): ReliSock: put_file: Failed to open file
C:\Documents and Settings\odw010\.condorqueue\D78aUAA.egs, errno = 2.
2/13 10:55:14 (7.0) (2676): ERROR "DoUpload: Failed to send file
C:\Documents and Settings\odw010\.condorqueue\D78aUAA.egs, exiting at
1398
" at line 1397 in file ..\src\condor_c++_util\file_transfer.C
2/13 11:07:43 (5.0) (1076): Job 5.0 is being evicted
2/13 11:07:43 (5.0) (1076): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 107

STARTER LOG OF EXECUTE MACHINE

2/13 06:40:56 ******************************************************
2/13 06:40:56 ** condor_starter (CONDOR_STARTER) STARTING UP
2/13 06:40:56 ** C:\Condor\bin\condor_starter.exe
2/13 06:40:56 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/13 06:40:56 ** $CondorPlatform: INTEL-WINNT50 $
2/13 06:40:56 ** PID = 4048
2/13 06:40:56 ******************************************************
2/13 06:40:56 Using config file: c:\condor\condor_config
2/13 06:40:56 Using local config files: C:\Condor/condor_config.local
2/13 06:40:56 DaemonCore: Command Socket at <130.116.147.52:9448>
2/13 06:40:56 Setting resource limits not implemented!
2/13 06:41:15 Starter communicating with condor_shadow
<130.155.67.83:9691>
2/13 06:41:15 Submitting machine is "student3-lu.minerals.csiro.au"
2/13 06:41:33 File transfer completed successfully.
2/13 06:41:33 Starting a VANILLA universe job with ID: 3.0
2/13 06:41:33 IWD: C:\Condor/execute\dir_4048
2/13 06:41:33 Output file: C:\Condor/execute\dir_4048\D7EG9AB.log
2/13 06:41:34 Renice expr "10" evaluated to 10
2/13 06:41:34 About to exec C:\Condor\execute\dir_4048\condor_exec.exe
D7EG9AB.egs
2/13 06:41:34 Create_Process succeeded, pid=2932
2/13 07:10:28 Got SIGQUIT.  Performing fast shutdown.
2/13 07:10:28 ShutdownFast all jobs.
2/13 07:10:28 Process exited, pid=2932, status=0
2/13 07:10:28 Last process exited, now Starter is exiting
2/13 07:10:28 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
2/13 07:38:11 ******************************************************
2/13 07:38:11 ** condor_starter (CONDOR_STARTER) STARTING UP
2/13 07:38:11 ** C:\Condor\bin\condor_starter.exe
2/13 07:38:11 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/13 07:38:11 ** $CondorPlatform: INTEL-WINNT50 $
2/13 07:38:11 ** PID = 3688
2/13 07:38:11 ******************************************************
2/13 07:38:11 Using config file: c:\condor\condor_config
2/13 07:38:11 Using local config files: C:\Condor/condor_config.local
2/13 07:38:11 DaemonCore: Command Socket at <130.116.147.52:9413>
2/13 07:38:11 Setting resource limits not implemented!
2/13 07:38:11 Starter communicating with condor_shadow
<130.155.67.83:9541>
2/13 07:38:11 Submitting machine is "student3-lu.minerals.csiro.au"
2/13 07:38:29 File transfer completed successfully.
2/13 07:38:29 Starting a VANILLA universe job with ID: 7.0
2/13 07:38:29 IWD: C:\Condor/execute\dir_3688
2/13 07:38:29 Output file: C:\Condor/execute\dir_3688\D78aUAA.log
2/13 07:38:29 Renice expr "10" evaluated to 10
2/13 07:38:29 About to exec C:\Condor\execute\dir_3688\condor_exec.exe
D78aUAA.egs
2/13 07:38:29 Create_Process succeeded, pid=2716
2/13 07:44:09 Process exited, pid=2716, status=0
2/13 07:44:10 ReliSock: put_file: Failed to open file
C:\Condor/execute\dir_3688\D78aUAA.condorlog, errno = 2.
2/13 07:44:10 ERROR "DoUpload: Failed to send file
C:\Condor/execute\dir_3688\D78aUAA.condorlog, exiting at 1398
" at line 1397 in file ..\src\condor_c++_util\file_transfer.C
2/13 07:44:10 ShutdownFast all jobs.
2/13 07:44:10 Error disabling account condor-reuse-vm1 (ACCESS DENIED)


SHADOW LOG OF SUBMITTING MACHINE

2/12 16:55:49 ******************************************************
2/12 16:55:49 ** condor_shadow (CONDOR_SHADOW) STARTING UP
2/12 16:55:49 ** C:\Condor\bin\condor_shadow.exe
2/12 16:55:49 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/12 16:55:49 ** $CondorPlatform: INTEL-WINNT50 $
2/12 16:55:49 ** PID = 1068
2/12 16:55:49 ******************************************************
2/12 16:55:49 Using config file: c:\condor\condor_config
2/12 16:55:49 Using local config files: C:\Condor/condor_config.local
2/12 16:55:50 DaemonCore: Command Socket at <130.155.67.83:9698>
2/12 16:56:12 Initializing a VANILLA shadow
2/12 16:56:12 (5.0) (1068): Request to run on <138.194.10.81:9018> was
ACCEPTED
2/12 16:56:40 (5.0) (1068): condor_read(): recv() returned -1, errno =
10054, assuming failure.
2/12 16:56:40 (5.0) (1068): condor_read(): recv() returned -1, errno =
10054, assuming failure.
2/12 16:56:41 (5.0) (1068): ERROR "Can no longer talk to condor_starter
on execute machine (138.194.10.81)" at line 63 in file
..\src\condor_shadow.V6.1\NTreceivers.C
2/12 16:56:42 ******************************************************
2/12 16:56:42 ** condor_shadow (CONDOR_SHADOW) STARTING UP
2/12 16:56:42 ** C:\Condor\bin\condor_shadow.exe
2/12 16:56:42 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/12 16:56:42 ** $CondorPlatform: INTEL-WINNT50 $
2/12 16:56:42 ** PID = 492
2/12 16:56:42 ******************************************************
2/12 16:56:42 Using config file: c:\condor\condor_config
2/12 16:56:42 Using local config files: C:\Condor/condor_config.local
2/12 16:56:42 DaemonCore: Command Socket at <130.155.67.83:9289>
2/12 16:57:04 Initializing a VANILLA shadow
2/12 16:57:04 (5.0) (492): Request to run on <138.194.10.81:9018> was
ACCEPTED
2/12 16:57:12 (5.0) (492): condor_read(): recv() returned -1, errno =
10054, assuming failure.
2/12 16:57:12 (5.0) (492): condor_read(): recv() returned -1, errno =
10054, assuming failure.
2/12 16:57:12 (5.0) (492): ERROR "Can no longer talk to condor_starter
on execute machine (138.194.10.81)" at line 63 in file
..\src\condor_shadow.V6.1\NTreceivers.C

STARTER LOG OF EXECUTING MACHINE

2/10 23:44:22 ******************************************************
2/10 23:44:22 ** condor_starter (CONDOR_STARTER) STARTING UP
2/10 23:44:22 ** C:\Condor\bin\condor_starter.exe
2/10 23:44:22 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/10 23:44:22 ** $CondorPlatform: INTEL-WINNT50 $
2/10 23:44:22 ** PID = 3508
2/10 23:44:22 ******************************************************
2/10 23:44:22 Using config file: C:\Condor\condor_config
2/10 23:44:22 Using local config files: C:\Condor/condor_config.local
2/10 23:44:22 DaemonCore: Command Socket at <138.194.10.81:9790>
2/10 23:44:22 Setting resource limits not implemented!
2/10 23:44:41 Starter communicating with condor_shadow
<130.155.67.83:9344>
2/10 23:44:41 Submitting machine is "student3-lu.minerals.CSIRO.AU"
2/10 23:44:47 File transfer completed successfully.
2/10 23:44:47 Starting a VANILLA universe job with ID: 4.0
2/10 23:44:47 IWD: C:\Condor/execute\dir_3508
2/10 23:44:47 Output file: C:\Condor/execute\dir_3508\D7EG9AC.log
2/10 23:44:47 Renice expr "10" evaluated to 10
2/10 23:44:47 About to exec C:\Condor\execute\dir_3508\condor_exec.exe
D7EG9AC.egs
2/10 23:44:47 Create_Process succeeded, pid=3860
2/10 23:45:08 Process exited, pid=3860, status=-1
2/10 23:45:09 ReliSock: put_file: Failed to open file
C:\Condor/execute\dir_3508\D7EG9AC.condorlog, errno = 2.
2/10 23:45:09 ERROR "DoUpload: Failed to send file
C:\Condor/execute\dir_3508\D7EG9AC.condorlog, exiting at 1398
" at line 1397 in file ..\src\condor_c++_util\file_transfer.C
2/10 23:45:09 ShutdownFast all jobs.
2/10 23:45:09 Error disabling account condor-reuse-vm1 (ACCESS DENIED)
2/10 23:45:32 ******************************************************
2/10 23:45:32 ** condor_starter (CONDOR_STARTER) STARTING UP
2/10 23:45:32 ** C:\Condor\bin\condor_starter.exe
2/10 23:45:32 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/10 23:45:32 ** $CondorPlatform: INTEL-WINNT50 $
2/10 23:45:32 ** PID = 3624
2/10 23:45:32 ******************************************************
2/10 23:45:32 Using config file: C:\Condor\condor_config
2/10 23:45:32 Using local config files: C:\Condor/condor_config.local
2/10 23:45:32 DaemonCore: Command Socket at <138.194.10.81:9438>
2/10 23:45:32 Setting resource limits not implemented!
2/10 23:45:33 Starter communicating with condor_shadow
<130.155.67.83:9216>
2/10 23:45:33 Submitting machine is "student3-lu.minerals.CSIRO.AU"
2/10 23:45:39 File transfer completed successfully.
2/10 23:45:39 Starting a VANILLA universe job with ID: 4.0
2/10 23:45:39 IWD: C:\Condor/execute\dir_3624
2/10 23:45:39 Output file: C:\Condor/execute\dir_3624\D7EG9AC.log
2/10 23:45:39 Renice expr "10" evaluated to 10
2/10 23:45:39 About to exec C:\Condor\execute\dir_3624\condor_exec.exe
D7EG9AC.egs
2/10 23:45:39 Create_Process succeeded, pid=4092
2/10 23:45:39 Process exited, pid=4092, status=-1
2/10 23:45:40 ReliSock: put_file: Failed to open file
C:\Condor/execute\dir_3624\D7EG9AC.condorlog, errno = 2.
2/10 23:45:40 ERROR "DoUpload: Failed to send file
C:\Condor/execute\dir_3624\D7EG9AC.condorlog, exiting at 1398
" at line 1397 in file ..\src\condor_c++_util\file_transfer.C
2/10 23:45:40 ShutdownFast all jobs.
2/10 23:45:40 Error disabling account condor-reuse-vm1 (ACCESS DENIED)

-----------------------------------------------------------------------
Greg Hitchen
greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining				phone:+61 8 6436
8663
Australian Resources Research Centre (ARRC)	fax:	+61 8 6436 8555
Postal address:						mob:	0407 952
748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-----------------------------------------------------------------------