[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor glidein ccb condor_write() broken pipe



Hi all,

I am using Condor Glideins with CCB & am experiencing a problem where partial file transfer is occuring, but then fails with the following on the central-manager side:

07/10/13 09:04:11 DaemonCore: command socket at <152.19.197.180:40872?noUDP>
07/10/13 09:04:11 DaemonCore: private command socket at <152.19.197.180:40872>
07/10/13 09:04:11 Setting maximum accepts per cycle 4.
07/10/13 09:04:11 Initializing a VANILLA shadow for job 598057.0
07/10/13 09:04:11 (598042.0) (14010): condor_write() failed: send() 65536 bytes to <152.54.2.30:40808> returned -1, timeout=0, errno=32 Broken pipe.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_bytes_nobuffer: Send failed.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer() returned -1)
07/10/13 09:04:11 (598042.0) (14010): DoUpload: SHADOW at 152.19.197.180 failed to send file(s) to <152.54.2.30:40808>: error sending /proj/seq/mapseq/RENCI/130508_UNC16-SN851_0242_BC241KACXX/NIDAUCSF/061210Sm/130508_UNC16-SN851_0242_BC\
241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam; STARTER at 10.9.15.247 failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_29411/130508_UNC16-SN851_0242_BC241KACXX_\
GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 09:04:11 (598042.0) (14010): ERROR "Error from slot1@xxxxxxxxxxxxxxxxx: Failed to transfer files" at line 676 in file /home/condor/execute/dir_15857/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp


Here is what I see on the compute node side:

07/10/13 08:48:12 entering FileTransfer::DoDownload sync=0
07/10/13 08:48:13 REMAP: begin with rules:
07/10/13 08:48:13 REMAP: 0: 130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 REMAP: res is 0 ->  !
07/10/13 08:48:13 Sending GoAhead for 152.19.197.180 to send /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr\
.bam and all further files.
07/10/13 08:48:13 Received GoAhead from peer to receive /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam.
07/10/13 08:48:13 get_file(): going to write to filename /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 get_file: Receiving 3267697542 bytes
07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from daemon at <152.19.197.180:40093>.
07/10/13 08:49:13 ReliSock::get_bytes_nobuffer: Failed to receive file.
07/10/13 08:49:13 get_file: wrote 58589184 bytes to file
07/10/13 08:49:13 get_file(): ERROR: received 58589184 bytes, expected 3267697542!
07/10/13 08:49:13 DoDownload: STARTER at 10.9.15.247 failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped\
.realign.fix.pr.bam
07/10/13 08:49:13 DoDownload: exiting at 2213


On the compute node side, I have the following in the condor_config.local:

HIGHPORT=41000
LOWPORT=40000

WANT_UDP_COMMAND_SOCKET=False
UPDATE_COLLECTOR_WITH_TCP=True

USE_CCB="True"
CCB_ADDRESS=$(COLLECTOR_HOST)
PRIVATE_NETWORK_NAME = $(FULL_HOSTNAME)


I am assuming that I have the configuration set up correctly as I am getting a partial download, but something is causing the socket connection to hang/timeout/fail. Any suggestions as to how I can find what is causing the "Broken pipe"?

Thanks,
Jason