[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Globus error 129 with large files



On Mar 11, 2016, at 3:33 PM, Emir Imamagic <eimamagi@xxxxxxx> wrote:
> 
> On 11.3.2016. 21:39, Brian Bockelman wrote:
>> Is it possible that the threshold between working and not working is either 2.1GB (about 2^31) or 4.2GB (about 2^32)?  That would help narrow down the potential sources of error.
> 
> I performed the following tests:
> 
> The 2^31 test worked:
> #!/bin/sh
> dd if=/dev/zero of=./testmonkey bs=1M count=2048
> 
> The 2^32 test also worked :
> #!/bin/sh
> dd if=/dev/zero of=./testmonkey bs=1M count=4096
> 
> In both cases generated file was successfully transferred back to Condor-G submit machine.
> 
> Seems to me that problems start with 2^33:
> #!/bin/sh
> dd if=/dev/zero of=./testmonkey bs=1M count=8192
> 
> Job ended up successful, but only 719M was transferred back.
> 
> 
> With 2^34 things get more complicated. Job ended, transfer back started and then gahp_server on UI side started devouring memory until OOM killed it:
> Mar 11 22:31:25 ui2 kernel: Out of memory: Kill process 1150031 (gahp_server) score 911 or sacrifice child
> Mar 11 22:31:25 ui2 kernel: Killed process 1150031, UID 500, (gahp_server) total-vm:9569828kB, anon-rss:7495604kB, file-rss:544kB
> 
> Interesting bit is that job did not end in H state, but instead condor revived gahp_server and OOM killed it again. This continued up to the point when I deleted the job.
> 
> I failed to mention we're running CentOS 6 on both CE and submit machine.
> 
> Hope this helps

The HTCondor gridmanager is checking the size of stdout and stderr as 32-bit values. If the GRAM jobmanager is doing the same, then the checks may not be sensitive to file sizes larger than 2GB or 4GB.

I suspect the Globus GASS server library inside the gahp_server is reading the entire file into memory before or as it writes it out to disk. The GASS file transfer protocol that we use with GRAM isnât suited for multi-gigabyte files.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project