Re: [HTCondor-users] Downloading big files is interrupted

On 1/9/2014 2:42 AM, Leon Thielen wrote:
we running HTCondor version 8.1.2.
Condor master and client are Windows 7 machines.
Submit host is the master
All the files will reside on a Linux a machine. Linux and windows are connected via samba.

It may help if you post the corresponding snippet from the ShadowLog on the submit machine (from the same timeframe as the below snippet). The file transfer happens between the condor_shadow (on submit machine) and the condor_starter (on the execute machine), and the below starter log snippet implies that shadow stopped sending...

Also, if there is a shared filesystem between the submit and execute node (via samba or whatever), I am kinda wondering if/why you desire HTCondor to transfer the files in the first place. Ie you could just leverage the shared filesystem...


transfer_input_files from big files will be interrupted after reading a couple of bytes. If running a job with a small input file (15,294,016) it works.
Running with a bigger file (9,195,290,624) we get
get_file(): ERROR: received 605356032 bytes, expected 9195290624!
Running job with an even bigger file we get
get_file(): ERROR: received 902758400 bytes, expected 30967531520!

StarterLog.slot1_1 :

01/08/14 09:59:26 setting the orig job name in starter
01/08/14 09:59:26 setting the orig job iwd in starter
01/08/14 09:59:26 Chirp config summary: IO false, Updates false, Delayed updates true.
01/08/14 09:59:26 Initialized IO Proxy.
01/08/14 09:59:26 Setting resource limits not implemented!
01/08/14 10:00:13 condor_read(): timeout reading 65536 bytes from <>.
01/08/14 10:00:13 ReliSock::get_bytes_nobuffer: Failed to receive file.
01/08/14 10:00:13 get_file(): ERROR: received 902758400 bytes, expected 30967531520!
01/08/14 10:00:14 DoDownload: STARTER at failed to receive file C:\condor\execute\dir_2636\reference-big.zip
01/08/14 10:00:14 File transfer failed (status=0).
01/08/14 10:00:14 ERROR "Failed to transfer files" at line 2120 in file c:\condor\execute\dir_27920\userdir\src\condor_starter.v6.1\jic_shadow.cpp
01/08/14 10:00:14 ShutdownFast all jobs.
01/08/14 10:00:14 condor_read() failed: recv(fd=1064) returned -1, errno = 10054 , reading 5 bytes from <>.
01/08/14 10:00:14 IO: Failed to read packet header

Can somebody help me too solve this issue?
Thanks Leon

