[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Shadow Exception, why?



I think these are the examples I should give from ShadowLog of the submitting machine:

12/16/13 18:11:08 ******************************************************
12/16/13 18:11:08 ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/16/13 18:11:08 ** C:\Condor\bin\condor_shadow.exe
12/16/13 18:11:08 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
12/16/13 18:11:08 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
12/16/13 18:11:08 ** $CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
12/16/13 18:11:08 ** $CondorPlatform: x86_64_Windows7 $
12/16/13 18:11:08 ** PID = 5980
12/16/13 18:11:08 ** Log last touched 12/16 18:11:02
12/16/13 18:11:08 ******************************************************
12/16/13 18:11:08 Using config source: C:\condor\condor_config
12/16/13 18:11:08 Using local config sources:
12/16/13 18:11:08    C:\Condor/condor_config.local
12/16/13 18:11:08 DaemonCore: command socket at <x.y.z.189:9760>
12/16/13 18:11:08 DaemonCore: private command socket at <x.y.z.189:9760>
12/16/13 18:11:08 Initializing a VANILLA shadow for job 118.5
12/16/13 18:11:08 (118.5) (5980): Request to run on slot2@xxxxxxxxxxxxxxxxxxx <x.y.z.158:9653> was ACCEPTED
12/16/13 18:11:08 (118.5) (5980): my_popen: CreateProcess failed
12/16/13 18:11:08 (118.5) (5980): FILETRANSFER: Failed to execute C:\Condor/bin/curl_plugin, ignoring
12/16/13 18:11:08 (118.5) (5980): FILETRANSFER: failed to add plugin "C:\Condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\Condor/bin/curl_plugin, ignoring
12/16/13 18:11:45 (118.8) (6256): ReliSock: put_file: TransmitFile() failed, errno=10054
12/16/13 18:11:45 (118.11) (3172): ReliSock: put_file: TransmitFile() failed, errno=10054
12/16/13 18:11:45 (118.10) (6252): ReliSock: put_file: TransmitFile() failed, errno=10054
12/16/13 18:11:45 (118.13) (6020): ReliSock: put_file: TransmitFile() failed, errno=10054
12/16/13 18:11:45 (118.14) (7624): ReliSock: put_file: TransmitFile() failed, errno=10054
12/16/13 18:11:45 (118.12) (6456): ReliSock: put_file: TransmitFile() failed, errno=10054
12/16/13 18:11:45 (118.9) (1192): ReliSock: put_file: TransmitFile() failed, errno=10054
12/16/13 18:11:45 (118.7) (5136): ReliSock: put_file: TransmitFile() failed, errno=10054
12/16/13 18:11:45 (118.11) (3172): DoUpload: SHADOW at x.y.z.189 failed to send file(s) to <x.y.z.201:9716>: error sending D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp; STARTER at x.y.z.201 failed to receive file C:\Condor\execute\dir_5160\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.8) (6256): DoUpload: SHADOW at x.y.z.189 failed to send file(s) to <x.y.z.138:9635>: error sending D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp; STARTER at x.y.z.138 failed to receive file C:\Condor\execute\dir_10020\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.10) (6252): DoUpload: SHADOW at x.y.z.189 failed to send file(s) to <x.y.z.201:9666>: error sending D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp; STARTER at x.y.z.201 failed to receive file C:\Condor\execute\dir_784\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.13) (6020): DoUpload: SHADOW at x.y.z.189 failed to send file(s) to <x.y.z.189:9768>: error sending D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp; STARTER at x.y.z.189 failed to receive file C:\Condor\execute\dir_7364\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.14) (7624): DoUpload: SHADOW at x.y.z.189 failed to send file(s) to <x.y.z.189:9635>: error sending D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp; STARTER at x.y.z.189 failed to receive file C:\Condor\execute\dir_7344\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.12) (6456): DoUpload: SHADOW at x.y.z.189 failed to send file(s) to <x.y.z.201:9782>: error sending D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp; STARTER at x.y.z.201 failed to receive file C:\Condor\execute\dir_5604\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.9) (1192): DoUpload: SHADOW at x.y.z.189 failed to send file(s) to <x.y.z.138:9650>: error sending D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp; STARTER at x.y.z.138 failed to receive file C:\Condor\execute\dir_13532\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.7) (5136): DoUpload: SHADOW at x.y.z.189 failed to send file(s) to <x.y.z.158:9800>: error sending D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp; STARTER at x.y.z.158 failed to receive file C:\Condor\execute\dir_4964\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.11) (3172): ERROR "Error from slot3@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.10) (6252): ERROR "Error from slot1@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.8) (6256): ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.14) (7624): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.13) (6020): ERROR "Error from slot1@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.12) (6456): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.9) (1192): ERROR "Error from slot3@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.7) (5136): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:50 ******************************************************



On Mon, Dec 16, 2013 at 6:30 PM, Ralph Finch <ralphmariafinch@xxxxxxxxx> wrote:
All Windows 7x64 pool
$CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
$CondorPlatform: x86_64_Windows7 $


I've been getting lots of Shadow Exceptions, here's a typical one (job log file):

000 (117.019.000) 12/16 18:02:12 Job submitted from host: <x.y.z.189:9728>
...
007 (117.019.000) 12/16 18:08:08 Shadow exception!
    Error from slot4@xxxxxxxxxxxxxxxxx: Failed to transfer files
    0  -  Run Bytes Sent By Job
    13252  -  Run Bytes Received By Job
...

The ShadowLog on the submit machine (.189) (bdomo-002):

12/16/13 18:18:22 (117.1) (6616): Job 117.1 is being evicted from slot2@xxxxxxxxxxxxxxxxxxx
12/16/13 18:18:22 (117.1) (6616): **** condor_shadow (condor_SHADOW) pid 6616 EXITING WITH STATUS 102
12/16/13 18:19:38 (117.5) (8068): Job 117.5 is being evicted from slot2@xxxxxxxxxxxxxxxxxxx
12/16/13 18:19:38 (117.5) (8068): **** condor_shadow (condor_SHADOW) pid 8068 EXITING WITH STATUS 102
12/16/13 18:19:40 (117.11) (7936): Job 117.11 is being evicted from slot4@xxxxxxxxxxxxxxxxxxx
12/16/13 18:19:40 (117.11) (7936): **** condor_shadow (condor_SHADOW) pid 7936 EXITING WITH STATUS 102
12/16/13 18:23:01 (117.2) (6880): Job 117.2 is being evicted from slot3@xxxxxxxxxxxxxxxxxxx
12/16/13 18:23:01 (117.2) (6880): **** condor_shadow (condor_SHADOW) pid 6880 EXITING WITH STATUS 102
12/16/13 18:23:12 (117.3) (6196): Job 117.3 is being evicted from slot4@xxxxxxxxxxxxxxxxxxx
12/16/13 18:23:12 (117.3) (6196): **** condor_shadow (condor_SHADOW) pid 6196 EXITING WITH STATUS 102

We have a typical nominal 1 Gb/s switch for our LAN. The files transferred for each submit job are a couple of dozen, and are at most 200 MB total size. 20 jobs submitted at one time to the queue.

Should this really cause a problem? Is there a way to find out if a failure to transfer files REALLY is the problem? I'm thinking not. Even though Condor starts new execute jobs, the master program (run interactively from a command prompt window) usually doesn't see them. So I submit another 20, kill the old set, and everything is good, no shadow exceptions and the master program finds its condorized slaves.  Maybe the shadows on my submit machine are giving up too quick because of some delay??

Ralph Finch
Calif. Dept. of Water Resources
Sacramento, Calif. USA