[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] spool to execute directory file transfers fail because of sparse and thin vm disks



Johnson koil Raj wrote:
> Hi,
>     when Starter trying to transfer file from spool directory It not
> able to transfer the sparse file correctly. It exited with socket error.
> but the network seems to be proper. Is it because  condor not not able
> to differentiate sparse or thin or normal files correctly. 
> 
> The Starter tried may times but it exited with following error and it
> not able to successfully transfer that file at all.
> 
> STARTERLog
> 5/25 19:11:27 get_file(): going to write to
> filename /vmfs/volumes/1107ffff-0ea6c919/execute/cloudesx2/dir_25659/vmBIbP33_condor-2bd4e3bf.vmss
> 5/25 19:11:27 get_file: Receiving 4296162111 bytes
> 5/25 19:18:01 DaemonCore: in SendAliveToParent()
> 5/25 19:18:01 DaemonCore: Leaving SendAliveToParent() - success
> 5/25 19:19:32 condor_read(): Socket closed when trying to read 65536
> bytes from <192.168.10.7:9621>
> 5/25 19:19:32 ReliSock::get_bytes_nobuffer: Failed to receive file.
> 5/25 19:19:32 get_file: wrote 2494562304 bytes to file
> 5/25 19:19:32 get_file(): ERROR: received 2494562304 bytes, expected
> 4296162111!
> 5/25 19:19:32 DoDownload: STARTER at 192.168.10.254 failed to receive
> file /vmfs/volumes/1107ffff-0ea6c919/execute/cloudesx2/dir_25659/vmBIbP33_condor-2bd4e3bf.vmss
> 5/25 19:19:32 condor_write(): Socket closed when trying to write 249
> bytes to <192.168.10.7:9621>, fd is 8
> 5/25 19:19:32 Buf::write(): condor_write() failed
> 5/25 19:19:32 Failed to send download failure report to
> <192.168.10.7:9621>.
> 5/25 19:19:32 DoDownload: exiting at 1743
> 5/25 19:19:32 DaemonCore: No more children processes to reap.
> 5/25 19:19:32 File transfer failed (status=0).
> 5/25 19:19:32 Calling client FileTransfer handler function.
> 5/25 19:19:32 ERROR "Failed to transfer files" at line 1780 in file
> jic_shadow.cpp
> 5/25 19:19:32 condor_write(): Socket closed when trying to write 165
> bytes to <192.168.10.7:9731>, fd is 10
> 5/25 19:19:32 Buf::write(): condor_write() failed
> 5/25 19:19:32 ERROR "Assertion ERROR on (result)" at line 875 in file
> NTsenders.cpp
> 5/25 19:19:32 Deleting the StarterHookMgr
> 
> some more analysis on that. if we do ls the file size is given bellow is
> what taken by condor while starting.
> [root@cloudesx2 cloudesx2]# ls
> -l /vmfs/volumes/nfs2/spool/cluster124.proc0.subproc0/vmBIbP33_condor-2bd4e3bf.vmss
> -rwxrwxrwx    1 root     root     4296162111 May 25
> 17:40 /vmfs/volumes/nfs2/spool/cluster124.proc0.subproc0/vmBIbP33_condor-2bd4e3bf.vmss
> 
> if we do du the file size is given  condor is taking this size after
> transferring is done. 
> [root@cloudesx2 root]#
> du /vmfs/volumes/nfs2/spool/cluster124.proc0.subproc0/vmBIbP33_condor-2bd4e3bf.vmss
> 245363	/vmfs/volumes/nfs2/spool/cluster124.proc0.subproc0/vmBIbP33_condor-2bd4e3bf.vmss
> 
> [root@cloudesx2 cloudesx2]# du
> -h /vmfs/volumes/nfs2/spool/cluster124.proc0.subproc0/vmBIbP33_condor-2bd4e3bf.vmss
> 240M	/vmfs/volumes/nfs2/spool/cluster124.proc0.subproc0/vmBIbP33_condor-2bd4e3bf.vmss
> 
> by
> Johnson

So you have a 4GB file and transfer is failing after 2GB, but only 200MB
are making it to disk?

The other place to be looking for errors is the log for the other side
of the transfer.

You've posted this a few times. Is this a transient issue or is it
easily reproduced?

Best,


matt