[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] A few questions about DAGMan



On Tue, 9 Sep 2014, Ralph Finch wrote:

1. Basically I am running 1000 to 2000 independent jobs, and when they all
finish, run a script at my console. The only way I could see how to do this
was to create a dummy script (dummy.bat which simply returns after doing
nothing), and in the .dag file:

JOB lastone dummy.bat
SCRIPT POST lastone done.bat ../dsm2.ctl
PARENT  RUN-1 RUN-2 RUN-3. . . RUN-2102 CHILD lastone

Well this works, but the last submitted job is pointless. I'd rather do the
following which didn't work:

JOB RUN-1 dsm2.sub
...blah blah..
SCRIPT POST done.bat ../dsm2.ctl

Why can't DAGMan simply run a script at the end, on the submitting machine?
Then I wouldn't need the dummy submit job, nor the very long PARENT line
(which Windows won't even create in a batch file as it's too long).

Note that you can split the parent-child relationships up into multiple lines, e.g.:

  parent RUN-1 child lastone
  parent RUN-2 child lastone
  parent RUN-3 child lastone
  ...

Although, as Zach suggested, you could also use a FINAL node, which would mean that you wouldn't have to worry about parent-child relationships.

Also note that you can use the NOOP node functionality:

  JOB|FINAL lastone dummy.bat NOOP
  SCRIPT POST lastone ...

I can't remember whether dummy.bat has to even exist for this to work, but at any rate there won't actually be an HTCondor job submitted.

2. I limit the number of jobs initially with
DAGMAN_MAX_JOBS_SUBMITTED = 60
in the dagman.config file.
However I'd like to change the limit while the DAG is running. Is there a
way to do this?

Right now there is not a clean way to do this. The "dirty" way is as follows:
1) Do condor_hold on the DAGMan job.
2) Edit your dagman.config to change DAGMAN_MAX_JOBS_SUBMITTED.
3) Do condor_release on the DAGMan job.

We want to implement a clean method, but that's some time in the future...

3. It seems the config variable DAGMAN_MAX_JOBS_IDLE in fact is for jobs in
either the IDLE or HELD state, not just IDLE, is that correct?

Yes, that's right.  Basically any job that's in the queue but not running.

4. In the .dag file I use the RETRY keyword with the hopes it will retry
jobs that failed and are HELD. Will it indeed do this? Right now I'm having
to occasionally issue condor_release -all, but I'd rather automate the
re-trying of failed jobs (they almost always work on retry).

Jobs that fail and go into the HELD state are not considered failed by DAGMan -- a job is only considered failed at the DAG level if it exits the queue. You might want to include a periodic_remove expression in your submit files that removes the job if it goes on hold -- that would trigger the DAGMan-level retry. Or else you could try setting a periodic_release expression if the jobs will work after they're released. At any rate, though, you need to fix this at the job level, not the DAG level.

Kent Wenger
CHTC Team