[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] problems using transfer_output_remaps



Hi Dan,

After using these suggestions, as well as working out some additional configuration issues on my end, I think I've gotten this all working as I'd like to. Much appreciated.



From: Dan Bradley <dan@xxxxxxxxxxxx>
Date: January 20, 2006 7:22:43 AM PST
To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>
Subject: Re: [Condor-users] problems using transfer_output_remaps
Reply-To: Condor-Users Mail List <condor-users@xxxxxxxxxxx>

Adam,

Two things:

1. transfer_output_remaps only applies when you turn on Condor's
file-transfer mode. I will add a warning so that condor_submit lets you know when you use remaps but have not turned on file-transfers. Here is
how you turn on transfers:

ShouldTransferFiles = True
WhenToTransferOutput = ON_EXIT


2. The output and error files are handled specially for you, so you
should never need to explicitly "remap" them.  For these files, just
specify the final path where you want the files.  (And make sure you
have turned on file-transfers.)  If you look at the resulting ClassAd
(with condor_q -l), you will see that your error file will be
automatically modified to a temporary filename that will be used in the
execute directory, and on download, it will be remapped to the final
path that you specified.

--Dan

Adam Lathers wrote:

Hi all,

	I'm having some issues using the transfer_output_remaps option in a
submit file.  Specifically, I'm submitting a DAG as a proof of
concept to work out the bugs before implementing a similar solution
for our big data processing codes.  Essentially, the layout of our
architecture looks something like this.  Our pool manager host
(schedd, collector, negotiator), exists "outside" our trusted realm,
so it has no access to our shared filesystem.  All the worker nodes
exist inside the trusted realm, and all share a filesystem.  (Yes, I
know there are some security paradigm issues there, but I can't solve
those presently).  What I do need to deal with is, the data we will
be working with is "big"...total in and out data is something in the
order of 100GB presently, and presently, it's not segmented into
"small" pieces, so each worker node, were it to ship the input data,
would have to grab a 20-50GB dataset before processing started.

	My goal in the short term is basically this.  I'd like to rely on
the shared file system, and just "mimic" what I need to on the submit
node.  Thus far, this works, but to make it happen, I need to
duplicate a directory structure on the submit node to look just like
the worker nodes.  What I'd "prefer" to do is leverage the
transfer_output_remaps option, so that when logs and output and such
get shipped back to the submit machine, it just goes into a single
large log directory, with some sort of intelligent naming mechanism.


an example submit that I've tried looks something like this.
(note, for the transfer_output_remaps, I've also tried just naming
A.err and so on.  Maybe I just missed the proper permutation?)

Universe        = vanilla
Executable      = /home/alathers/condor_matlab/condor_test/matlab.sh
InitialDir      = /home/alathers/condor_matlab/condor_test
Error           = /home/alathers/condor_matlab/condor_test_submitdir/
A.err
Log             = /home/alathers/condor_matlab/condor_test_submitdir/
A.log
transfer_output_remaps = "/home/alathers/condor_matlab/
condor_test_submitdir/A.err = /home/alathers/condor_matlab/logs/ A.err"
GetEnv          = true
Arguments	= A
Requirements 	= FileSystemDomain == "ncmir.ucsd.edu"
Notification    = Error
Notify_user     = alathers@xxxxxxxxxxxxxx
Queue


	In the end, when the job finishes, the .log and .err files are sent
back to the submit node, and put in /home/alathers/condor_matlab/
condor_test_submitdir/

	I'm sure I'm forgetting some vital piece of info, so please feel
free to let me know.  Any thoughts, or insight would be REALLY
appreciated.  As noted, I know there are a LOT of problems with the
present approach, but for various reasons my role is to solve this
step first, before redesigning the process.  Thanx everyone.


_______________________________________________________
Adam Lathers
NCMIR: National Center for Microscopy and Imaging Research
Distributed Systems Engineer
phone: (858) 822-0735
fax:   (858) 822-0828
web:   http://ncmir.ucsd.edu


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users



_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users



_______________________________________________________
Adam Lathers
NCMIR: National Center for Microscopy and Imaging Research
Distributed Systems Engineer
phone: (858) 822-0735
fax:   (858) 822-0828
web:   http://ncmir.ucsd.edu