[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Globus error 129 with large files



On 11.3.2016. 21:39, Brian Bockelman wrote:
Is it possible that the threshold between working and not working is either 2.1GB (about 2^31) or 4.2GB (about 2^32)?  That would help narrow down the potential sources of error.

I performed the following tests:

The 2^31 test worked:
#!/bin/sh
dd if=/dev/zero of=./testmonkey bs=1M count=2048

The 2^32 test also worked :
#!/bin/sh
dd if=/dev/zero of=./testmonkey bs=1M count=4096

In both cases generated file was successfully transferred back to Condor-G submit machine.

Seems to me that problems start with 2^33:
#!/bin/sh
dd if=/dev/zero of=./testmonkey bs=1M count=8192

Job ended up successful, but only 719M was transferred back.


With 2^34 things get more complicated. Job ended, transfer back started and then gahp_server on UI side started devouring memory until OOM killed it: Mar 11 22:31:25 ui2 kernel: Out of memory: Kill process 1150031 (gahp_server) score 911 or sacrifice child Mar 11 22:31:25 ui2 kernel: Killed process 1150031, UID 500, (gahp_server) total-vm:9569828kB, anon-rss:7495604kB, file-rss:544kB

Interesting bit is that job did not end in H state, but instead condor revived gahp_server and OOM killed it again. This continued up to the point when I deleted the job.

I failed to mention we're running CentOS 6 on both CE and submit machine.

Hope this helps
--
Emir Imamagic
SRCE - University of Zagreb University Computing Centre, www.srce.unizg.hr
Emir.Imamagic@xxxxxxx, tel: +385 1 616 5809, fax: +385 1 616 5559