[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor glidein ccb condor_write() broken pipe



Hi Jason,

The root of the problem is in this message from your starter log:

07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from daemon at <152.19.197.180:40093>.

Judging from the timestamps in the log file, the timeout was 60 seconds.  This doesn't have anything to do with CCB.

You could increase the timeout by setting something like

STARTER_TIMEOUT_MULTIPLIER=5

However, 60 seconds is a long time to transmit 65536 bytes.  Has your submit node maxing out its network or disk bandwidth?  In HTCondor 8.0, there are some attributes in the schedd ClassAd that monitor bandwidth usage by file transfer:

condor_status -schedd -l | grep BytesPerSecond

If things other than HTCondor file transfer are using bandwidth on the submit machine, you will need to look at general system statistics to see the effect of those.

Of course, the submit node isn't the only place where a bottleneck might appear.  The site where the glideins are running could also be maxed out.

--Dan

On 7/10/13 8:27 AM, Jason wrote:
Hi all,

I am using Condor Glideins with CCB & am experiencing a problem where partial file transfer is occuring, but then fails with the following on the central-manager side:

07/10/13 09:04:11 DaemonCore: command socket at <152.19.197.180:40872?noUDP>
07/10/13 09:04:11 DaemonCore: private command socket at <152.19.197.180:40872>
07/10/13 09:04:11 Setting maximum accepts per cycle 4.
07/10/13 09:04:11 Initializing a VANILLA shadow for job 598057.0
07/10/13 09:04:11 (598042.0) (14010): condor_write() failed: send() 65536 bytes to <152.54.2.30:40808> returned -1, timeout=0, errno=32 Broken pipe.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_bytes_nobuffer: Send failed.
07/10/13 09:04:11 (598042.0) (14010): ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer() returned -1)
07/10/13 09:04:11 (598042.0) (14010): DoUpload: SHADOW at 152.19.197.180 failed to send file(s) to <152.54.2.30:40808>: error sending /proj/seq/mapseq/RENCI/130508_UNC16-SN851_0242_BC241KACXX/NIDAUCSF/061210Sm/130508_UNC16-SN851_0242_BC\
241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam; STARTER at 10.9.15.247 failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_29411/130508_UNC16-SN851_0242_BC241KACXX_\
GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 09:04:11 (598042.0) (14010): ERROR "Error from slot1@xxxxxxxxxxxxxxxxx: Failed to transfer files" at line 676 in file /home/condor/execute/dir_15857/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp


Here is what I see on the compute node side:

07/10/13 08:48:12 entering FileTransfer::DoDownload sync=0
07/10/13 08:48:13 REMAP: begin with rules:
07/10/13 08:48:13 REMAP: 0: 130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 REMAP: res is 0 ->  !
07/10/13 08:48:13 Sending GoAhead for 152.19.197.180 to send /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr\
.bam and all further files.
07/10/13 08:48:13 Received GoAhead from peer to receive /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam.
07/10/13 08:48:13 get_file(): going to write to filename /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped.realign.fix.pr.bam
07/10/13 08:48:13 get_file: Receiving 3267697542 bytes
07/10/13 08:49:13 condor_read(): timeout reading 65536 bytes from daemon at <152.19.197.180:40093>.
07/10/13 08:49:13 ReliSock::get_bytes_nobuffer: Failed to receive file.
07/10/13 08:49:13 get_file: wrote 58589184 bytes to file
07/10/13 08:49:13 get_file(): ERROR: received 58589184 bytes, expected 3267697542!
07/10/13 08:49:13 DoDownload: STARTER at 10.9.15.247 failed to receive file /projects/mapseq/jlrm/glideins/2013-07-09/6e0fc74f-9ba2-4a95-80fc-8c0276cad418/execute/dir_28735/130508_UNC16-SN851_0242_BC241KACXX_GTGGCC_L004.fixed-rg.deduped\
.realign.fix.pr.bam
07/10/13 08:49:13 DoDownload: exiting at 2213


On the compute node side, I have the following in the condor_config.local:

HIGHPORT=41000
LOWPORT=40000

WANT_UDP_COMMAND_SOCKET=False
UPDATE_COLLECTOR_WITH_TCP=True

USE_CCB="True"
CCB_ADDRESS=$(COLLECTOR_HOST)
PRIVATE_NETWORK_NAME = $(FULL_HOSTNAME)


I am assuming that I have the configuration set up correctly as I am getting a partial download, but something is causing the socket connection to hang/timeout/fail.  Any suggestions as to how I can find what is causing the "Broken pipe"?

Thanks,
Jason




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/