The root of the problem is in this message from your starter log:
07/10/13 08:49:13 condor_read(): timeout
reading 65536 bytes from daemon at <126.96.36.199:40093>.
Judging from the timestamps in the log file, the timeout was 60
seconds. This doesn't have anything to do with CCB.
You could increase the timeout by setting something like
However, 60 seconds is a long time to transmit 65536 bytes. Has
your submit node maxing out its network or disk bandwidth? In
HTCondor 8.0, there are some attributes in the schedd ClassAd that
monitor bandwidth usage by file transfer:
condor_status -schedd -l | grep
If things other than HTCondor file transfer are using bandwidth on
the submit machine, you will need to look at general system
statistics to see the effect of those.
Of course, the submit node isn't the only place where a bottleneck
might appear. The site where the glideins are running could also be
On 7/10/13 8:27 AM, Jason wrote:
I am using Condor Glideins with CCB & am experiencing a
problem where partial file transfer is occuring, but then fails
with the following on the central-manager side:
07/10/13 09:04:11 DaemonCore: command socket at
07/10/13 09:04:11 DaemonCore: private command socket at
07/10/13 09:04:11 Setting maximum accepts per cycle 4.
07/10/13 09:04:11 Initializing a VANILLA shadow for job 598057.0
07/10/13 09:04:11 (598042.0) (14010): condor_write() failed:
send() 65536 bytes to <188.8.131.52:40808> returned -1,
timeout=0, errno=32 Broken pipe.
07/10/13 09:04:11 (598042.0) (14010):
ReliSock::put_bytes_nobuffer: Send failed.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_file: failed
to put 65536 bytes (put_bytes_nobuffer() returned -1)
07/10/13 09:04:11 (598042.0) (14010): DoUpload: SHADOW at
184.108.40.206 failed to send file(s) to
<220.127.116.11:40808>: error sending
at 10.9.15.247 failed to receive file
07/10/13 09:04:11 (598042.0) (14010): ERROR "Error from
slot1@xxxxxxxxxxxxxxxxx: Failed to transfer files" at line 676 in
Here is what I see on the compute node side:
07/10/13 08:48:12 entering FileTransfer::DoDownload sync=0
07/10/13 08:48:13 REMAP: begin with rules:
07/10/13 08:48:13 REMAP: 0:
07/10/13 08:48:13 REMAP: res is 0 -> !
07/10/13 08:48:13 Sending GoAhead for 18.104.22.168 to send
.bam and all further files.
07/10/13 08:48:13 Received GoAhead from peer to receive
07/10/13 08:48:13 get_file(): going to write to filename
07/10/13 08:48:13 get_file: Receiving 3267697542 bytes
07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from
daemon at <22.214.171.124:40093>.
07/10/13 08:49:13 ReliSock::get_bytes_nobuffer: Failed to receive
07/10/13 08:49:13 get_file: wrote 58589184 bytes to file
07/10/13 08:49:13 get_file(): ERROR: received 58589184 bytes,
07/10/13 08:49:13 DoDownload: STARTER at 10.9.15.247 failed to
07/10/13 08:49:13 DoDownload: exiting at 2213
On the compute node side, I have the following in the
PRIVATE_NETWORK_NAME = $(FULL_HOSTNAME)
I am assuming that I have the configuration set up correctly as I
am getting a partial download, but something is causing the socket
connection to hang/timeout/fail. Any suggestions as to how I can
find what is causing the "Broken pipe"?
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: