[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] 'Failed to transfer files' issue



 

Hello,

 

 

We are running Condor on Windows and having some issues.

 

Currently we have 8.6.3 installed, with central manager, submit, and execute nodes separated.    All these machines are running Windows 7 and Condor was installed using the 64-bit MSI package.   Adding our user credentials to CREDD and submitting jobs goes OK, but performance is slow compared to running apps directly on execute nodes outside of Condor.    Scanning through the Condor logs, errors are seen relating to file transfers.

 

 

In StarterLogs on the execute nodes, we see many sections like the following:

 

05/18/17 14:07:17 (pid:6932) condor_read(): timeout reading 5 bytes from <10.85.1.216:9618>.

05/18/17 14:07:17 (pid:6932) IO: Failed to read packet header

05/18/17 14:07:17 (pid:6932) Failed to receive filesize in ReliSock::get_file

05/18/17 14:07:17 (pid:6932) DoDownload: STARTER at 10.85.1.224 failed to receive file C:\condor\execute\dir_6932\MC3ADV.DLL

05/18/17 14:07:17 (pid:6932) File transfer failed (status=0).

05/18/17 14:07:17 (pid:6932) ERROR "Failed to transfer files" at line 2364 in file C:\condor\execute\dir_13584\sources\src\condor_starter.V6.1\jic_shadow.cpp

05/18/17 14:07:17 (pid:6932) ShutdownFast all jobs.

05/18/17 14:07:17 (pid:6932) condor_read() failed: recv(fd=772) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:55287>.

05/18/17 14:07:17 (pid:6932) IO: Failed to read packet header

05/18/17 14:07:17 (pid:6932) Lost connection to shadow, waiting 2400 secs for reconnect

 

05/18/17 14:39:02 (pid:6624) condor_read(): timeout reading 5 bytes from <10.85.1.216:9618>.

05/18/17 14:39:02 (pid:6624) IO: Failed to read packet header

05/18/17 14:39:02 (pid:6624) Failed to receive filesize in ReliSock::get_file

05/18/17 14:39:02 (pid:6624) DoDownload: STARTER at 10.85.1.224 failed to receive file C:\condor\execute\dir_6624\PICN20.DLL

05/18/17 14:39:02 (pid:6624) File transfer failed (status=0).

05/18/17 14:39:02 (pid:6624) ERROR "Failed to transfer files" at line 2364 in file C:\condor\execute\dir_13584\sources\src\condor_starter.V6.1\jic_shadow.cpp

05/18/17 14:39:02 (pid:6624) ShutdownFast all jobs.

05/18/17 14:39:03 (pid:6624) condor_read() failed: recv(fd=1156) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:57504>.

05/18/17 14:39:03 (pid:6624) IO: Failed to read packet header

05/18/17 14:39:03 (pid:6624) Lost connection to shadow, waiting 2400 secs for reconnect

 

05/18/17 14:44:21 (pid:2760) condor_read(): timeout reading 5 bytes from <10.85.1.216:9618>.

05/18/17 14:44:21 (pid:2760) Failed to receive filesize in ReliSock::get_file

05/18/17 14:44:21 (pid:2760) DoDownload: STARTER at 10.85.1.224 failed to receive file C:\condor\execute\dir_2760\iconv.dll

05/18/17 14:44:21 (pid:2760) File transfer failed (status=0).

05/18/17 14:44:21 (pid:2760) ERROR "Failed to transfer files" at line 2364 in file C:\condor\execute\dir_13584\sources\src\condor_starter.V6.1\jic_shadow.cpp

05/18/17 14:44:21 (pid:2760) ShutdownFast all jobs.

05/18/17 14:44:21 (pid:2760) condor_read() failed: recv(fd=1200) returned -1, errno = 10054 , reading 5 bytes from <10.85.1.216:58028>.

05/18/17 14:44:21 (pid:2760) IO: Failed to read packet header

05/18/17 14:44:21 (pid:2760) Lost connection to shadow, waiting 2400 secs for reconnect05/18/17 14:44:21 (pid:2760) IO: Failed to read packet header

 

 

Meanwhile, in the ShadowLog on the submit node, the failure to transmit files is also seen all about:

 

05/18/17 14:07:17 (4.34) (4588): ReliSock: put_file: TransmitFile() failed, errno=10022

05/18/17 14:07:17 (4.35) (2912): ReliSock: put_file: TransmitFile() failed, errno=10022

05/18/17 14:07:17 (4.35) (2912): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51523>.

05/18/17 14:07:17 (4.35) (2912): IO: Failed to read packet header

05/18/17 14:07:17 (4.35) (2912): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.224:51523>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL

05/18/17 14:07:17 (4.34) (4588): condor_read() failed: recv(fd=420) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51534>.

05/18/17 14:07:17 (4.34) (4588): IO: Failed to read packet header

05/18/17 14:07:17 (4.34) (4588): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.224:51534>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL

05/18/17 14:07:17 (4.35) (2912): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp

05/18/17 14:07:17 (4.34) (4588): ERROR "Error from slot1@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp

05/18/17 14:07:17 (4.36) (2292): ReliSock: put_file: TransmitFile() failed, errno=10022

05/18/17 14:07:17 (4.36) (2292): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.237:50293>.

05/18/17 14:07:17 (4.36) (2292): IO: Failed to read packet header

05/18/17 14:07:17 (4.36) (2292): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50293>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\MC3ADV.DLL

05/18/17 14:07:17 (4.36) (2292): ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo

 

05/18/17 14:08:29 (4.27) (3312): Request to run on slot3@xxxxxxxxxxxxxxxxxxxxxxx <10.85.1.237:49312?addrs=10.85.1.237-49312> was ACCEPTED

05/18/17 14:08:32 (4.38) (5560): Request to run on slot2@xxxxxxxxxxxxxxxxxxxxxxx <10.85.1.224:9618?addrs=10.85.1.224-9618&noUDP&sock=4204_6b8e_3> was ACCEPTED

05/18/17 14:08:56 (4.39) (5804): ReliSock: put_file: TransmitFile() failed, errno=10054

05/18/17 14:08:56 (4.33) (3684): ReliSock: put_file: TransmitFile() failed, errno=10054

05/18/17 14:08:56 (4.33) (3684): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50376>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll; STARTER at 10.85.1.237 failed to receive file C:\condor\execute\dir_2480\libxml2.dll

05/18/17 14:08:56 (4.35) (4132): ReliSoc2k: put_file: TransmitFile() failed, errno=10022

05/18/17 14:08:56 (4.35) (4132): condor_read() failed: recv(fd=576) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.237:50389>.

05/18/17 14:08:56 (4.35) (4132): IO: Failed to read packet header

05/18/17 14:08:56 (4.35) (4132): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50389>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll

05/18/17 14:08:56 (4.39) (5804): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50402>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll; STARTER at 10.85.1.237 failed to receive file C:\condor\execute\dir_1296\libxml2.dll

05/18/17 14:08:56 (4.33) (3684): ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp

05/18/17 14:08:56 (4.36) (2088): ReliSock: put_file: TransmitFile() failed, errno=10054

05/18/17 14:08:56 (4.36) (2088): condor_read() failed: recv(fd=556) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.224:51625>.

05/18/17 14:08:56 (4.36) (2088): IO: Failed to read packet header

05/18/17 14:08:56 (4.36) (2088): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.224:51625>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\libxml2.dll

05/18/17 14:08:56 (4.35) (4132): ERROR "Error from slot1@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp

05/18/17 14:08:56 (4.36) (2088): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp

05/18/17 14:08:56 (4.39) (5804): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp

05/18/17 14:08:57 ******************************************************

 

05/18/17 14:09:17 Initializing a VANILLA shadow for job 4.41

05/18/17 14:09:27 (4.41) (6140): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxx <10.85.1.224:9618?addrs=10.85.1.224-9618&noUDP&sock=4204_6b8e_3> was ACCEPTED

05/18/17 14:09:28 (4.35) (5432): ReliSock: put_file: TransmitFile() failed, errno=10022

05/18/17 14:09:29 (4.35) (5432): condor_read() failed: recv(fd=552) returned -1, errno = 10053 , reading 5 bytes from <10.85.1.237:50431>.

05/18/17 14:09:29 (4.35) (5432): IO: Failed to read packet header

05/18/17 14:09:29 (4.35) (5432): DoUpload: SHADOW at 10.85.1.216 failed to send file(s) to <10.85.1.237:50431>: error sending \\lyta\tomodev\USER\Run_IC10.0.2_base\AlgoMammo.dll

05/18/17 14:09:29 (4.35) (5432): ERROR "Error from slot4@xxxxxxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 570 in file C:\condor\execute\dir_13584\sources\src\condor_shadow.V6.1\pseudo_ops.cpp

05/18/17 14:09:29 ******************************************************

 

 

Eventually the submitted jobs do complete, but with all the failures, it’s much later than would be expected if things had executed without issue.  This issue is happening to all our users, who run similar, but differing versions of their application.   

 

 

Any thoughts on what might be causing this?   Or, what might we do to troubleshoot?

 

 

 

Thank-you,

 

Robert