[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] A few questions about DAGMan



$CondorVersion: 8.2.1 Jun 27 2014 BuildID: 256063 $
$CondorPlatform: x86_64_Windows8 $ÂÂ [<=== Windows8?? No, it's Windows 7]

I've got DAGMan running but find I have a few questions.

1. Basically I am running 1000 to 2000 independent jobs, and when they all finish, run a script at my console. The only way I could see how to do this was to create a dummy script (dummy.bat which simply returns after doing nothing), and in the .dag file:

JOB RUN-1 dsm2.sub
VARS RUN-1 JOBNO="$(JOB)"
RETRY RUN-1 2
JOB RUN-2 dsm2.sub
VARS RUN-2 JOBNO="$(JOB)"
RETRY RUN-2 2
. . . . .
JOB RUN-2102 dsm2.sub
VARS RUN-2102 JOBNO="$(JOB)"
RETRY RUN-2102 2
JOB lastone dummy.bat
SCRIPT POST lastone done.bat ../dsm2.ctl
PARENTÂ RUN-1 RUN-2 RUN-3. . . RUN-2102 CHILD lastone

Well this works, but the last submitted job is pointless. I'd rather do the following which didn't work:

JOB RUN-1 dsm2.sub
...blah blah..
SCRIPT POST done.bat ../dsm2.ctl

Why can't DAGMan simply run a script at the end, on the submitting machine? Then I wouldn't need the dummy submit job, nor the very long PARENT line (which Windows won't even create in a batch file as it's too long).

2. I limit the number of jobs initially with
DAGMAN_MAX_JOBS_SUBMITTED = 60
in the dagman.config file.
However I'd like to change the limit while the DAG is running. Is there a way to do this?

3. It seems the config variable DAGMAN_MAX_JOBS_IDLE in fact is for jobs in either the IDLE or HELD state, not just IDLE, is that correct?

4. In the .dag file I use the RETRY keyword with the hopes it will retry jobs that failed and are HELD. Will it indeed do this? Right now I'm having to occasionally issue condor_release -all, but I'd rather automate the re-trying of failed jobs (they almost always work on retry).

Many Thanks-
Ralph Finch
Calif. Dept. of Water Resources
Sacramento, CA USA