[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor glidein ccb condor_write() broken pipe

Hi Jason,

The root of the problem is in this message from your starter log:

07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from daemon at <>.

Judging from the timestamps in the log file, the timeout was 60 seconds.  This doesn't have anything to do with CCB.

You could increase the timeout by setting something like


However, 60 seconds is a long time to transmit 65536 bytes.  Has your submit node maxing out its network or disk bandwidth?  In HTCondor 8.0, there are some attributes in the schedd ClassAd that monitor bandwidth usage by file transfer:

condor_status -schedd -l | grep BytesPerSecond

If things other than HTCondor file transfer are using bandwidth on the submit machine, you will need to look at general system statistics to see the effect of those.

Of course, the submit node isn't the only place where a bottleneck might appear.  The site where the glideins are running could also be maxed out.


On 7/10/13 8:27 AM, Jason wrote:
Hi all,

I am using Condor Glideins with CCB & am experiencing a problem where partial file transfer is occuring, but then fails with the following on the central-manager side:

07/10/13 09:04:11 DaemonCore: command socket at <>
07/10/13 09:04:11 DaemonCore: private command socket at <>
07/10/13 09:04:11 Setting maximum accepts per cycle 4.
07/10/13 09:04:11 Initializing a VANILLA shadow for job 598057.0
07/10/13 09:04:11 (598042.0) (14010): condor_write() failed: send() 65536 bytes to <> returned -1, timeout=0, errno=32 Broken pipe.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_bytes_nobuffer: Send failed.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer() returned -1)
07/10/13 09:04:11 (598042.0) (14010): DoUpload: SHADOW at failed to send file(s) to <>: error sending /proj/seq/mapseq/RENCI/130508_UNC16-SN851_0242_BC241KACXX/NIDAUCSF/061210Sm/130508_UNC16-SN851_0242_BC\
241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam; STARTER at failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_29411/130508_UNC16-SN851_0242_BC241KACXX_\
07/10/13 09:04:11 (598042.0) (14010): ERROR "Error from slot1@xxxxxxxxxxxxxxxxx: Failed to transfer files" at line 676 in file /home/condor/execute/dir_15857/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp

Here is what I see on the compute node side:

07/10/13 08:48:12 entering FileTransfer::DoDownload sync=0
07/10/13 08:48:13 REMAP: begin with rules:
07/10/13 08:48:13 REMAP: 0: 130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 REMAP: res is 0 ->  !
07/10/13 08:48:13 Sending GoAhead for to send /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr\
.bam and all further files.
07/10/13 08:48:13 Received GoAhead from peer to receive /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam.
07/10/13 08:48:13 get_file(): going to write to filename /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 get_file: Receiving 3267697542 bytes
07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from daemon at <>.
07/10/13 08:49:13 ReliSock::get_bytes_nobuffer: Failed to receive file.
07/10/13 08:49:13 get_file: wrote 58589184 bytes to file
07/10/13 08:49:13 get_file(): ERROR: received 58589184 bytes, expected 3267697542!
07/10/13 08:49:13 DoDownload: STARTER at failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped\
07/10/13 08:49:13 DoDownload: exiting at 2213

On the compute node side, I have the following in the condor_config.local:




I am assuming that I have the configuration set up correctly as I am getting a partial download, but something is causing the socket connection to hang/timeout/fail.  Any suggestions as to how I can find what is causing the "Broken pipe"?


HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: