[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Corrupt files on HTCondor transfer to node



Hi Roberto,


I'm assuming there's more to your script (I don't see tarId or groupSize defined...)


This line here:

  command = ("tar -zcf ./sandbox/pack_%d.tar.gz commandLine.txt ./data/%s/"%(tarId,d)) + (" ./data/%s/"%d).join(selectedFiles)

Seems that you are creating the file "pack_<tarId>.tar.gz" and tarId is not varying in the code you sent me.  So you are overwriting that file with each loop through 'd'.

Presumably your submit file then references that pack file as an input_file.  So it looks like you are in fact overwriting it after submitting a job that uses it, and if that's the case when the job starts it is possibly trying to transfer a file that's only partially constructed (and hence truncated).

Maybe you meant (d,tarId) instead of (tarId,d)?

If I'm mistaken, can you please send me your entire script so I can see the context?  (Off-list is fine if you'd prefer.)


Cheers,
-zach




> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of
> Roberto Tavares
> Sent: Thursday, April 05, 2018 11:54 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] Corrupt files on HTCondor transfer to node
> 
> Hi,
> 
> 
> 
> 	Is it possible pack_17.tar.gz on the submit side is still being
> created at the time when your HTCondor job starts?
> 
> 
> 
> I can't see how... I used the python script bellow
> 
> for d in os.listdir("./data"):
>     files = os.listdir("./data/%s"%d)
>     for i in range(0, len(files), groupSize):
>         if len(sys.argv) == 1:
>             selectedFiles = files[i:i+groupSize]
>             os.system("echo '%s' > commandLine.txt"%configuration)
>             command = ("tar -zcf ./sandbox/pack_%d.tar.gz commandLine.txt
> ./data/%s/"%(tarId,d)) + (" ./data/%s/"%d).join(selectedFiles)
>             os.system(command)
>             os.system("rm commandLine.txt")
>             for teste in testes:
>                 configuration = "--method=%s --stage2=%s --stage3=%s --
> n=%s"%(teste["init"], teste["stage2"], teste["stage3"], teste["n"])
>                 command = "condor_submit condorexecfile n=%d
> alg=%s"%(tarId, teste["alg"])
>                 os.system(command)
> 
> 
> 
> 	As Greg said, the timestamps are odd.  Perhaps the HTCondor job was
> launched before pack_17.tar.gz was actually ready?  Or something else
> outside of HTCondor is modifying it?  Are these pack_ files static or being
> created dynamically as part of a workflow?
> 
> 
> 
> I create the pack files just once. And sync the clocks with time server.
> Still same issue...
> 
> Ideas?
> 
> Thank you!!!
> 
> Roberto
> 
> 
> 
> 
> 
> 
> 	Cheers,
> 	-zach
> 
> 
> 
> 	> -----Original Message-----
> 	> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx
> <mailto:htcondor-users-bounces@xxxxxxxxxxx> > On Behalf Of Greg
> 	> Thain
> 	> Sent: Thursday, April 05, 2018 11:00 AM
> 	> To: htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>
> 	> Subject: Re: [HTCondor-users] Corrupt files on HTCondor transfer
> to node
> 	>
> 	> On 04/05/2018 08:17 AM, Roberto Tavares wrote:
> 	>
> 	>
> 	>       Hello,
> 	>
> 	>       Well, I run the set of jobs twice (same procedure). On the
> first
> 	> time, it worked. On the second time, I got some errors.
> 	>
> 	>       On the submission node, I got
> 	>
> 	>       -rw-rw-r-- 1 myuser mygroup 110193 Abr  5 07:45
> pack_17.tar.gz
> 	>
> 	>
> 	>       On the execution procedure, I've inserted a "ls -al", and I
> got a
> 	> smaller file:
> 	>
> 	>       -rw-rw-r-- 1 nobody nogroup 49152 Apr  5 07:44
> pack_17.tar.gz
> 	>
> 	>
> 	>
> 	> Assuming your clocks are synchronized across submit and execute
> machines,
> 	> these timestamps seem suspicious.
> 	>
> 	> -greg
> 	>
> 	>
> 
> 
> 
> 	_______________________________________________
> 	HTCondor-users mailing list
> 	To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> <mailto:htcondor-users-request@xxxxxxxxxxx>  with a
> 	subject: Unsubscribe
> 	You can also unsubscribe by visiting
> 	https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>
> 
> 	The archives can be found at:
> 	https://lists.cs.wisc.edu/archive/htcondor-users/
> <https://lists.cs.wisc.edu/archive/htcondor-users/>
> 
>