[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Passing condor_dagman args with condor_submit_dag?



Hi,

I am attempting to use Condor for a large distributed batch processing project. I'm using condor_dagman as a meta scheduler by limiting the number of jobs that occur at the same time. I've organized each job to be an iteration of a loop, and I have 2 layers of recursion. Let me throw some numbers out there: my outer loop iterates 100 times, and my inner loop iterates 1000 times (each of these loops contains a DAG). I am implementing looping by unrolling the logical loop into a dynamically generated DAG file.

While my solution might prevent condor's scheduler from getting overloaded with jobs, I am faced with another problem: organizing the files on disk so that one directory doesn't contain something on the order of 100*1000 = 100,000's of submit files (and a multiple for output and log files). I'm starting with the obvious: make a directory and subdirectory for each iteration of the inner loop. However, I am running across a problem.

condor_dag_submit -no_submit accepts my *.dag file and produces a *.dag.condor.sub file, but I am having trouble properly referencing this *.dag.condor.sub file from a *.dag file in the parent directory. I think this is because condor_dag_submit does not let me configure some of condor_dagman's arguments in the submit file it generates. For example:

outer *.dag file:

JOB MAINDAG_111 111/maindag_111.dag.condor.sub
JOB MAINDAG_222 222/maindag_222.dag.condor.sub

111/maindag_111.dag.condor.sub:

# Filename: maindag_111.dag.condor.sub
# Generated by condor_submit_dag maindag_111.dag
universe        = scheduler
executable      = /opt/condor/bin/condor_dagman
getenv          = True
output          = maindag_111.dag.lib.out
error           = maindag_111.dag.lib.out
log             = maindag_111.dag.dagman.log
remove_kill_sig = SIGUSR1
on_exit_remove  = (ExitBySignal == false || ExitSignal =!= 9)
arguments = -f -l . -Debug 3 -Lockfile maindag_111.dag.lock -Condorlog /tmp/exp6/111/process_a_111.log -Dag maindag_111.dag -Rescue maindag_111.dag.rescue -MaxIdle 5 -MaxJobs 1 -UseDagDir environment = _CONDOR_DAGMAN_LOG=maindag_111.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
queue

When the outer condor_dagman reads and tries to execute the inner loop's condor_dagman, it fails, because it looks in the outer directory for maindag_111.dag rather than in the directory 111 (where the above submit file, and anything related to maindag_111*, is).

Is there a way I can tell condor_dag_submit to pass particular arguments (e.g. -Dag, -Rescue, output files) to the submit file it generates?

It would be cool if there was a way to get condor_dagman to chdir() into a directory before executing. I looked at -UseDagDir, but this will put output/log files in the parent directory - something I am trying to avoid.

I guess I could write my own condor_submit_dag too, but I'd rather not go to that extreme. :-)

Any insight would be great.  Thanks!

 - Armen

--
Armen Babikyan
MIT Lincoln Laboratory
armenb@xxxxxxxxxx . 781-981-1796