Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] condor glidein ccb condor_write() broken pipe
- Date: Wed, 10 Jul 2013 09:27:37 -0400
- From: Jason <jdr0887@xxxxxxxxx>
- Subject: [HTCondor-users] condor glidein ccb condor_write() broken pipe
Hi all,
I am using Condor Glideins with CCB & am experiencing a problem where
partial file transfer is occuring, but then fails with the following on
the central-manager side:
07/10/13 09:04:11 DaemonCore: command socket at <152.19.197.180:40872?noUDP>
07/10/13 09:04:11 DaemonCore: private command socket at <152.19.197.180:40872>
07/10/13 09:04:11 Setting maximum accepts per cycle 4.
07/10/13 09:04:11 Initializing a VANILLA shadow for job 598057.0
07/10/13 09:04:11 (598042.0) (14010): condor_write() failed: send() 65536 bytes to <152.54.2.30:40808> returned -1, timeout=0, errno=32 Broken pipe.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_bytes_nobuffer: Send failed.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer() returned -1)
07/10/13 09:04:11 (598042.0) (14010): DoUpload: SHADOW at 152.19.197.180 failed to send file(s) to <152.54.2.30:40808>: error sending /proj/seq/mapseq/RENCI/130508_UNC16-SN851_0242_BC241KACXX/NIDAUCSF/061210Sm/130508_UNC16-SN851_0242_BC\
241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam; STARTER at 10.9.15.247 failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_29411/130508_UNC16-SN851_0242_BC241KACXX_\
GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 09:04:11 (598042.0) (14010): ERROR "Error from slot1@xxxxxxxxxxxxxxxxx: Failed to transfer files" at line 676 in file /home/condor/execute/dir_15857/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
Here is what I see on the compute node side:
07/10/13 08:48:12 entering FileTransfer::DoDownload sync=0
07/10/13 08:48:13 REMAP: begin with rules:
07/10/13 08:48:13 REMAP: 0: 130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 REMAP: res is 0 -> !
07/10/13 08:48:13 Sending GoAhead for 152.19.197.180 to send /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr\
.bam and all further files.
07/10/13 08:48:13 Received GoAhead from peer to receive /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam.
07/10/13 08:48:13 get_file(): going to write to filename /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 get_file: Receiving 3267697542 bytes
07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from daemon at <152.19.197.180:40093>.
07/10/13 08:49:13 ReliSock::get_bytes_nobuffer: Failed to receive file.
07/10/13 08:49:13 get_file: wrote 58589184 bytes to file
07/10/13 08:49:13 get_file(): ERROR: received 58589184 bytes, expected 3267697542!
07/10/13 08:49:13 DoDownload: STARTER at 10.9.15.247 failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped\
.realign.fix.pr.bam
07/10/13 08:49:13 DoDownload: exiting at 2213
On the compute node side, I have the following in the condor_config.local:
HIGHPORT=41000
LOWPORT=40000
WANT_UDP_COMMAND_SOCKET=False
UPDATE_COLLECTOR_WITH_TCP=True
USE_CCB="True"
CCB_ADDRESS=$(COLLECTOR_HOST)
PRIVATE_NETWORK_NAME = $(FULL_HOSTNAME)
I am assuming that I have the configuration set up correctly as I am
getting a partial download, but something is causing the socket
connection to hang/timeout/fail. Any suggestions as to how I can find
what is causing the "Broken pipe"?
Thanks,
Jason