[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Corrupt files on HTCondor transfer to node



Hello,

Well, I run the set of jobs twice (same procedure). On the first time, it worked. On the second time, I got some errors.Â

On the submission node, I got

-rw-rw-r-- 1 myuser mygroup 110193 Abr Â5 07:45 pack_17.tar.gz

On the execution procedure, I've inserted a "ls -al", and I got a smaller file:

-rw-rw-r-- 1 nobody nogroup 49152 Apr Â5 07:44 pack_17.tar.gz

Note: On the first time I run the procedure, it worked.

Maybe is an issue on the file transfer? Any ideas? Maybe there is Âa way that the job can sign some "retry" signal?

Thank you!!!

PS: The log seems ok...

000 (224459.000.000) 04/05 07:43:30 Job submitted from host: <192.168.0.2:45849?addrs=192.168.0.2-45849>
...
001 (224459.000.000) 04/05 07:44:08 Job executing on host: <192.168.0.4:30669?addrs=192.168.0.4-30669>
...
006 (224459.000.000) 04/05 07:44:10 Image size of job updated: 53988
0 Â- ÂMemoryUsage of job (MB)
0 Â- ÂResidentSetSize of job (KB)
...
005 (224459.000.000) 04/05 07:44:10 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Local Usage
28617 Â- ÂRun Bytes Sent By Job
3882368 Â- ÂRun Bytes Received By Job
28617 Â- ÂTotal Bytes Sent By Job
3882368 Â- ÂTotal Bytes Received By Job
Partitionable Resources : Â ÂUsage ÂRequest Allocated
 Cpus         :         1     1
 Disk (KB)      Â:   4000   4000 110960793
 Memory (MB)     Â:    Â0    Â4   Â1991
...




------------------------------------------------------------------------------------------------------------------------
Prof. Dr. Roberto Fernandes Tavares Neto
Departamento de Engenharia de ProduÃÃo / Industrial Engineering Department
Universidade Federal de SÃo Carlos
tavares@xxxxxxxxxxxxx   tel +55 16 3351-9532
http://www.dep.ufscar.br/tavares
------------------------------------------------------------------------------------------------------------------------

On Wed, Apr 4, 2018 at 6:07 PM, Zach Miller <zmiller@xxxxxxxxxxx> wrote:
To narrow down the source of the problem, can you have your job print the md5sum of the file before unpacking it? And perhaps the file size? (To see if it's being corrupted versus truncated somehow)


Cheers,
-zach


> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@cs.wisc.edu> On Behalf Of
> Roberto Tavares
> Sent: Wednesday, April 04, 2018 3:20 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] Corrupt files on HTCondor transfer to node
>
> Hello,
>
> I'm having some trouble when running multiple jobs on HTCondor. My only
> guess is that in some random moment a transferred file is corrupted the the
> transmission procedure.
>
> What I got:
>
> Several .tar.gz files (datafiles). Let's say, one of those files is
> pack.tar.gz
>
> Several tests (1, 2, 3, ...12) that uses pack.tar.gz.
>
> pack.tar.gz is a valid file (it can be uncompressed at submission node).
>
> from the 12 tests, 11 works. One test (random), I got the following error:
>
> gzip: stdin: unexpected end of file
> tar: Child returned status 1
> tar: Error is not recoverable: exiting now
>
> The testing processing is the same (just changing some parameters on the
> following steps).
>
> The only thing that I can imagine is that the file transfer at some point
> fails (maybe a network issue?).
>
> Is there a way to solve this problem?
>
> Thanks
>
> Roberto

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/