[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Shadow Exception, why?



From the below, it looks like the shadow was sending the file to the condor_starter on the execute machine, and then the condor_starter went away (errno=10054 on Windows is connection reset). Hopefully the StarterLog.slotX on the execute machines would shed more light. One reason for the condor_starter "going away" is perhaps due to the policy expressions on the execute nodes; maybe the PREEMPT expression became true...

Todd


On 12/16/2013 8:37 PM, Ralph Finch wrote:
I think these are the examples I should give from ShadowLog of the
submitting machine:

12/16/13 18:11:08 ******************************************************
12/16/13 18:11:08 ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/16/13 18:11:08 ** C:\Condor\bin\condor_shadow.exe
12/16/13 18:11:08 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
class=DAEMON(1)
12/16/13 18:11:08 ** Configuration: subsystem:SHADOW local:<NONE>
class:DAEMON
12/16/13 18:11:08 ** $CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
12/16/13 18:11:08 ** $CondorPlatform: x86_64_Windows7 $
12/16/13 18:11:08 ** PID = 5980
12/16/13 18:11:08 ** Log last touched 12/16 18:11:02
12/16/13 18:11:08 ******************************************************
12/16/13 18:11:08 Using config source: C:\condor\condor_config
12/16/13 18:11:08 Using local config sources:
12/16/13 18:11:08    C:\Condor/condor_config.local
12/16/13 18:11:08 DaemonCore: command socket at <x.y.z.189:9760>
12/16/13 18:11:08 DaemonCore: private command socket at <x.y.z.189:9760>
12/16/13 18:11:08 Initializing a VANILLA shadow for job 118.5
12/16/13 18:11:08 (118.5) (5980): Request to run on
slot2@xxxxxxxxxxxxxxxxxxx <x.y.z.158:9653> was ACCEPTED
12/16/13 18:11:08 (118.5) (5980): my_popen: CreateProcess failed
12/16/13 18:11:08 (118.5) (5980): FILETRANSFER: Failed to execute
C:\Condor/bin/curl_plugin, ignoring
12/16/13 18:11:08 (118.5) (5980): FILETRANSFER: failed to add plugin
"C:\Condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute
C:\Condor/bin/curl_plugin, ignoring
12/16/13 18:11:45 (118.8) (6256): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.11) (3172): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.10) (6252): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.13) (6020): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.14) (7624): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.12) (6456): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.9) (1192): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.7) (5136): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.11) (3172): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.201:9716>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.201 failed to receive file
C:\Condor\execute\dir_5160\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.8) (6256): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.138:9635>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.138 failed to receive file
C:\Condor\execute\dir_10020\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.10) (6252): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.201:9666>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.201 failed to receive file
C:\Condor\execute\dir_784\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.13) (6020): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.189:9768>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.189 failed to receive file
C:\Condor\execute\dir_7364\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.14) (7624): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.189:9635>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.189 failed to receive file
C:\Condor\execute\dir_7344\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.12) (6456): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.201:9782>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.201 failed to receive file
C:\Condor\execute\dir_5604\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.9) (1192): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.138:9650>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.138 failed to receive file
C:\Condor\execute\dir_13532\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.7) (5136): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.158:9800>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.158 failed to receive file
C:\Condor\execute\dir_4964\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.11) (3172): ERROR "Error from
slot3@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.10) (6252): ERROR "Error from
slot1@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.8) (6256): ERROR "Error from
slot2@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.14) (7624): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.13) (6020): ERROR "Error from
slot1@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.12) (6456): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.9) (1192): ERROR "Error from
slot3@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.7) (5136): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:50 ******************************************************



On Mon, Dec 16, 2013 at 6:30 PM, Ralph Finch <ralphmariafinch@xxxxxxxxx>wrote:

All Windows 7x64 pool
$CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
$CondorPlatform: x86_64_Windows7 $


I've been getting lots of Shadow Exceptions, here's a typical one (job log
file):

000 (117.019.000) 12/16 18:02:12 Job submitted from host: <x.y.z.189:9728>
...
007 (117.019.000) 12/16 18:08:08 Shadow exception!
     Error from slot4@xxxxxxxxxxxxxxxxx: Failed to transfer files
     0  -  Run Bytes Sent By Job
     13252  -  Run Bytes Received By Job
...

The ShadowLog on the submit machine (.189) (bdomo-002):

12/16/13 18:18:22 (117.1) (6616): Job 117.1 is being evicted from
slot2@xxxxxxxxxxxxxxxxxxx
12/16/13 18:18:22 (117.1) (6616): **** condor_shadow (condor_SHADOW) pid
6616 EXITING WITH STATUS 102
12/16/13 18:19:38 (117.5) (8068): Job 117.5 is being evicted from
slot2@xxxxxxxxxxxxxxxxxxx
12/16/13 18:19:38 (117.5) (8068): **** condor_shadow (condor_SHADOW) pid
8068 EXITING WITH STATUS 102
12/16/13 18:19:40 (117.11) (7936): Job 117.11 is being evicted from
slot4@xxxxxxxxxxxxxxxxxxx
12/16/13 18:19:40 (117.11) (7936): **** condor_shadow (condor_SHADOW) pid
7936 EXITING WITH STATUS 102
12/16/13 18:23:01 (117.2) (6880): Job 117.2 is being evicted from
slot3@xxxxxxxxxxxxxxxxxxx
12/16/13 18:23:01 (117.2) (6880): **** condor_shadow (condor_SHADOW) pid
6880 EXITING WITH STATUS 102
12/16/13 18:23:12 (117.3) (6196): Job 117.3 is being evicted from
slot4@xxxxxxxxxxxxxxxxxxx
12/16/13 18:23:12 (117.3) (6196): **** condor_shadow (condor_SHADOW) pid
6196 EXITING WITH STATUS 102

We have a typical nominal 1 Gb/s switch for our LAN. The files transferred
for each submit job are a couple of dozen, and are at most 200 MB total
size. 20 jobs submitted at one time to the queue.

Should this really cause a problem? Is there a way to find out if a
failure to transfer files REALLY is the problem? I'm thinking not. Even
though Condor starts new execute jobs, the master program (run
interactively from a command prompt window) usually doesn't see them. So I
submit another 20, kill the old set, and everything is good, no shadow
exceptions and the master program finds its condorized slaves.  Maybe the
shadows on my submit machine are giving up too quick because of some delay??

Ralph Finch
Calif. Dept. of Water Resources
Sacramento, Calif. USA




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685