Re: [HTCondor-users] Shadow Exception, why?

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

OK, I looked at another one.

First, the condor log file of the job:

000 (129.010.000) 12/17 14:35:35 Job submitted from host: <1.2.3.189:9728>
...
007 (129.010.000) 12/17 14:39:52 Shadow exception!
    Error from slot1@xxxxxxxxxxxxxxxxxxx: Failed to transfer files
    0 - Run Bytes Sent By Job
    12057 - Run Bytes Received By Job
...

Then, the StarterLog.1 of the bdomo-005 machine:

12/17/13 14:38:57 ******************************************************
12/17/13 14:38:57 ** condor_starter (CONDOR_STARTER) STARTING UP
12/17/13 14:38:57 ** C:\Condor\bin\condor_starter.exe
12/17/13 14:38:57 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
12/17/13 14:38:57 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
12/17/13 14:38:57 ** $CondorVersion: 8.0.2 Aug 15 2013 BuildID: 162062 $
12/17/13 14:38:57 ** $CondorPlatform: x86_64_Windows7 $
12/17/13 14:38:57 ** PID = 4212
12/17/13 14:38:57 ** Log last touched 12/17 14:35:08
12/17/13 14:38:57 ******************************************************
12/17/13 14:38:57 Using config source: C:\condor\condor_config
12/17/13 14:38:57 Using local config sources:
12/17/13 14:38:57 C:\Condor/condor_config.local
12/17/13 14:38:57 DaemonCore: command socket at <1.2.3.199:9771>
12/17/13 14:38:57 DaemonCore: private command socket at <1.2.3.199:9771>
12/17/13 14:38:57 GLEXEC_JOB not supported on this platform; ignoring
12/17/13 14:38:57 Communicating with shadow <1.2.3.189:9611>
12/17/13 14:38:57 Submitting machine is "1.2.3.189"
12/17/13 14:38:57 setting the orig job name in starter
12/17/13 14:38:57 setting the orig job iwd in starter
12/17/13 14:38:57 Setting resource limits not implemented!
12/17/13 14:38:57 my_popen: CreateProcess failed
12/17/13 14:38:57 FILETRANSFER: Failed to execute C:\Condor/bin/curl_plugin, ignoring
12/17/13 14:38:57 FILETRANSFER: failed to add plugin "C:\Condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute C:\Condor/bin/curl_plugin, ignoring
12/17/13 14:39:47 condor_read(): timeout reading 1195 bytes from <1.2.3.189:9611>.
12/17/13 14:39:47 ReliSock::get_bytes_nobuffer: Failed to receive file.
12/17/13 14:39:47 get_file(): ERROR: received 0 bytes, expected 1195!
12/17/13 14:39:47 DoDownload: STARTER at 1.2.3.199 failed to receive file C:\Condor\execute\dir_4212\PEST_Hydro_Out.inp
12/17/13 14:39:47 File transfer failed (status=0).
12/17/13 14:39:47 ERROR "Failed to transfer files" at line 2050 in file c:\condor\execute\dir_20420\userdir\src\condor_starter.v6.1\jic_shadow.cpp
12/17/13 14:39:47 ShutdownFast all jobs.
12/17/13 14:39:52 condor_read() failed: recv(fd=744) returned -1, errno = 10054 , reading 5 bytes from <1.2.3.189:9691>.
12/17/13 14:39:52 IO: Failed to read packet header

Note: I now have set the following in condor_config:

SHADOW_RENICE_INCREMENT    = 0
POLLING_INTERVAL    = 5
JOB_START_DELAY        = 20
MAX_SHADOW_EXCEPTIONS    = 0
SHADOW_SIZE_ESTIMATE    = 150000

It seems to clearly have failed to receive one of many files, but I don't see why (and not sure there's anything in HTCondor that could reveal more details). I set the JOB_START_DELAY to 20 seconds specifically to give the shadows plenty of time to spin up, but it didn't help. All our machines have 12GB of RAM, a nominal 1Gb/s switch to the LAN, and 4 cores.

Is there a setting that could tell HTCondor to resend a file that failed to transfer?

On Tue, Dec 17, 2013 at 10:24 AM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:

From the below, it looks like the shadow was sending the file to the condor_starter on the execute machine, and then the condor_starter went away (errno=10054 on Windows is connection reset). Hopefully the StarterLog.slotX on the execute machines would shed more light. One reason for the condor_starter "going away" is perhaps due to the policy expressions on the execute nodes; maybe the PREEMPT _expression_ became true...

Todd

Mailing List Archives

Public Access

Re: [HTCondor-users] Shadow Exception, why?