[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] problems using transfer_output_remaps



Hi all,

I'm having some issues using the transfer_output_remaps option in a submit file. Specifically, I'm submitting a DAG as a proof of concept to work out the bugs before implementing a similar solution for our big data processing codes. Essentially, the layout of our architecture looks something like this. Our pool manager host (schedd, collector, negotiator), exists "outside" our trusted realm, so it has no access to our shared filesystem. All the worker nodes exist inside the trusted realm, and all share a filesystem. (Yes, I know there are some security paradigm issues there, but I can't solve those presently). What I do need to deal with is, the data we will be working with is "big"...total in and out data is something in the order of 100GB presently, and presently, it's not segmented into "small" pieces, so each worker node, were it to ship the input data, would have to grab a 20-50GB dataset before processing started.

My goal in the short term is basically this. I'd like to rely on the shared file system, and just "mimic" what I need to on the submit node. Thus far, this works, but to make it happen, I need to duplicate a directory structure on the submit node to look just like the worker nodes. What I'd "prefer" to do is leverage the transfer_output_remaps option, so that when logs and output and such get shipped back to the submit machine, it just goes into a single large log directory, with some sort of intelligent naming mechanism.


an example submit that I've tried looks something like this.
(note, for the transfer_output_remaps, I've also tried just naming A.err and so on. Maybe I just missed the proper permutation?)

Universe        = vanilla
Executable      = /home/alathers/condor_matlab/condor_test/matlab.sh
InitialDir      = /home/alathers/condor_matlab/condor_test
Error = /home/alathers/condor_matlab/condor_test_submitdir/ A.err Log = /home/alathers/condor_matlab/condor_test_submitdir/ A.log transfer_output_remaps = "/home/alathers/condor_matlab/ condor_test_submitdir/A.err = /home/alathers/condor_matlab/logs/A.err"
GetEnv          = true
Arguments	= A
Requirements 	= FileSystemDomain == "ncmir.ucsd.edu"
Notification    = Error
Notify_user     = alathers@xxxxxxxxxxxxxx
Queue


In the end, when the job finishes, the .log and .err files are sent back to the submit node, and put in /home/alathers/condor_matlab/ condor_test_submitdir/

I'm sure I'm forgetting some vital piece of info, so please feel free to let me know. Any thoughts, or insight would be REALLY appreciated. As noted, I know there are a LOT of problems with the present approach, but for various reasons my role is to solve this step first, before redesigning the process. Thanx everyone.


_______________________________________________________
Adam Lathers
NCMIR: National Center for Microscopy and Imaging Research
Distributed Systems Engineer
phone: (858) 822-0735
fax:   (858) 822-0828
web:   http://ncmir.ucsd.edu