[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Output files not returned on vanilla pool



$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
$CondorPlatform: INTEL-WINNT50 $

In a Windows command prompt loop I am submitting 10 jobs. Only 1 or 2 return their output; the rest fail to do so, with a typical error message in the condor submit log file as follows:

log file
=====
007 (775.000.000) 01/27 10:43:04 Shadow exception!
    Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxxxx: STARTER at 136.200.32.102 failed to send file(s) to <136.200.32.179:2187>: error reading from Z:\Condor\execute\dir_5600\HIST-CLB2K-ManN_Ch001.dss: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <136.200.32.102:2517>
    0  -  Run Bytes Sent By Job
    476290016  -  Run Bytes Received By Job
...
012 (775.000.000) 01/27 10:43:04 Job was held.
    Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxxxxx: STARTER at 136.200.32.102 failed to send file(s) to <136.200.32.179:2187>: error reading from Z:\Condor\execute\dir_5600\HIST-CLB2K-ManN_Ch001.dss: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <136.200.32.102:2517>
    Code 13 Subcode 2


starter log for execute machine, slot2
===========================

1/27 10:39:34 ******************************************************
1/27 10:39:34 ** condor_starter (CONDOR_STARTER) STARTING UP
1/27 10:39:34 ** Z:\Condor\bin\condor_starter.exe
1/27 10:39:34 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
1/27 10:39:34 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
1/27 10:39:34 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
1/27 10:39:34 ** $CondorPlatform: INTEL-WINNT50 $
1/27 10:39:34 ** PID = 5600
1/27 10:39:34 ** Log last touched 1/27 10:19:46
1/27 10:39:34 ******************************************************
1/27 10:39:34 Using config source: Z:\condor\condor_config
1/27 10:39:34 Using local config sources:
1/27 10:39:34    Z:/Condor/condor_config.local
1/27 10:39:34 DaemonCore: Command Socket at <136.200.32.102:2466>
1/27 10:39:34 GLEXEC_JOB not supported on this platform; ignoring
1/27 10:39:34 Setting resource limits not implemented!
1/27 10:39:34 Communicating with shadow <136.200.32.179:2187>
1/27 10:39:34 Submitting machine is "abbey.ad.water.ca.gov"
1/27 10:39:34 setting the orig job name in starter
1/27 10:39:34 setting the orig job iwd in starter
1/27 10:40:01 File transfer completed successfully.
1/27 10:40:01 Job 775.0 set to execute immediately
1/27 10:40:01 Starting a VANILLA universe job with ID: 775.0
1/27 10:40:01 Tracking process family by login "condor-reuse-slot2"
1/27 10:40:01 IWD: Z:\Condor\execute\dir_5600
1/27 10:40:01 Output file: Z:\Condor\execute\dir_5600\dsm2.out
1/27 10:40:01 Error file: Z:\Condor\execute\dir_5600\dsm2.err
1/27 10:40:01 Renice expr "10" evaluated to 10
1/27 10:40:01 About to exec C:\WINDOWS\system32\cmd.exe /Q /C condor_exec.bat
1/27 10:40:01 Create_Process succeeded, pid=2792
1/27 10:43:04 Process exited, pid=2792, status=0
1/27 10:43:04 ReliSock: put_file: Failed to open file Z:\Condor\execute\dir_5600\HIST-CLB2K-ManN_Ch001.dss, errno = 2.
1/27 10:43:04 DoUpload: (Condor error code 13, subcode 2) STARTER at 136.200.32.102 failed to send file(s) to <136.200.32.179:2187>: error reading from Z:\Condor\execute\dir_5600\HIST-CLB2K-ManN_Ch001.dss: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <136.200.32.102:2517>
1/27 10:43:04 JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt
1/27 10:43:04 Got SIGQUIT.  Performing fast shutdown.
1/27 10:43:04 ShutdownFast all jobs.
1/27 10:43:04 condor_read(): recv() returned -1, errno = 10054, assuming failure reading 5 bytes from <136.200.32.179:2191>.
1/27 10:43:04 IO: Failed to read packet header
1/27 10:43:04 Failed to send job exit status to shadow
1/27 10:43:04 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
1/27 10:43:04 **** condor_starter (condor_STARTER) pid 5600 EXITING WITH STATUS 0

The problem seems obvious above, it failed to open file Z:\Condor\execute\dir_5600\HIST-CLB2K-ManN_Ch001.dss
so of course could not transfer it.  But a run later submitting just one job to the same machine/slot is fine:

1/27 11:22:20 ******************************************************
1/27 11:22:20 ** condor_starter (CONDOR_STARTER) STARTING UP
1/27 11:22:20 ** Z:\Condor\bin\condor_starter.exe
1/27 11:22:20 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
1/27 11:22:20 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
1/27 11:22:20 ** $CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $
1/27 11:22:20 ** $CondorPlatform: INTEL-WINNT50 $
1/27 11:22:20 ** PID = 3972
1/27 11:22:20 ** Log last touched 1/27 10:43:04
1/27 11:22:20 ******************************************************
1/27 11:22:20 Using config source: Z:\condor\condor_config
1/27 11:22:20 Using local config sources:
1/27 11:22:20    Z:/Condor/condor_config.local
1/27 11:22:20 DaemonCore: Command Socket at <136.200.32.102:2802>
1/27 11:22:20 GLEXEC_JOB not supported on this platform; ignoring
1/27 11:22:20 Setting resource limits not implemented!
1/27 11:22:20 Communicating with shadow <136.200.32.179:2938>
1/27 11:22:20 Submitting machine is "abbey.ad.water.ca.gov"
1/27 11:22:20 setting the orig job name in starter
1/27 11:22:20 setting the orig job iwd in starter
1/27 11:22:46 File transfer completed successfully.
1/27 11:22:46 Job 784.0 set to execute immediately
1/27 11:22:46 Starting a VANILLA universe job with ID: 784.0
1/27 11:22:46 Tracking process family by login "condor-reuse-slot2"
1/27 11:22:46 IWD: Z:\Condor\execute\dir_3972
1/27 11:22:46 Output file: Z:\Condor\execute\dir_3972\dsm2.out
1/27 11:22:46 Error file: Z:\Condor\execute\dir_3972\dsm2.err
1/27 11:22:46 Renice expr "10" evaluated to 10
1/27 11:22:46 About to exec C:\WINDOWS\system32\cmd.exe /Q /C condor_exec.bat
1/27 11:22:46 Create_Process succeeded, pid=2848
1/27 11:24:35 Process exited, pid=2848, status=0
1/27 11:24:36 Got SIGQUIT.  Performing fast shutdown.
1/27 11:24:36 ShutdownFast all jobs.
1/27 11:24:36 **** condor_starter (condor_STARTER) pid 3972 EXITING WITH STATUS 0

The desired output file is 8MB in size and there is room on each executing machine's hard drive.  Why would
opening that output file fail on most but not all times?

the .sub file:
=========

universe = vanilla
getenv = true
SHOULD_TRANSFER_FILES  =  ALWAYS
WHEN_TO_TRANSFER_OUTPUT  =  ON_EXIT_OR_EVICT
Rank = kflops
#Requirements = (Machine != "DELTA-MOD.ad.water.ca.gov")
executable = condor_dsm2.bat
error = dsm2$(ClusterID).err
log = dsm2$(ClusterID).log
output = dsm2$(ClusterID).out
transfer_input_files = d:\delta\dsm2_v8\bin\hydro.exe, d:\delta\dsm2_v8\bin\qual.exe, config.inp, hydro.inp, qual_ec.inp <snip>

transfer_output_files = HIST-CLB2K-ManN_Ch009.dss

queue