[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman Newbie Questions



On Thu, 9 Oct 2008, Jeremy Yabrow wrote:

It seems that condor_q only shows the dag jobs that running or that CAN run. Jobs that are blocked due to prequisite jobs not done yet, are not shown. Also, if I keep the jobs in the system and attempt to re-run them (using condor_hold & condor_release), then the dependencies are not obeyed during this subsequent "run"

Yes, that's right. Once a job is submitted, DAGMan doesn't do anything to it (besides removing it if you remove the condor_dagman job itself).

If I understand what's going on, the dagman job appears to be simply a job that submits other jobs, and the downstream jobs are not even submitted until their prerequisite jobs have run. Subsequent runs can only be run again with the .rescue file. Is this correct?

Well, the first part of this is basically correct.  DAGMan does a little
more than just submitting jobs, but yes, jobs whose prerequisities are not
satisfied are not submitted to Condor.

As far as the rescue DAG goes, though, you only get a rescue DAG if the workflow fails or if you condor_rm the condor_dagman job. If you do run
a rescue DAG, you don't re-run all of the jobs, only the ones that didn't
finish (or were not run at all) the first time around.

If you want to re-run a DAG from scratch, you need to do

    condor_submit_dag -f <whatever>.dag

This will re-run the whole DAG regardless of whether it succeeded the first time.

This has consequences for us because in our business, deadlines are critical and resource utilization must be maximized. So progressive estimates of completion and remaining work are necessary. We need all nodes of the entire DAG to be present in the system to estimate resource use, even though many of the nodes may be blocked waiting for prerequisite nodes. Jobs that submit other jobs are a nasty surprise for our resource managers. Also re-running of any node and its dependent nodes is fairly common and is often done many times during pipeline troubleshooting-we don't want to have to re-submit the entire DAG several times in separate runs because there may be long-running nodes in the DAG we want to continue in parallel while we're working on other "broken" nodes.

As far as estimates of completion go, you can get some idea by looking at the dagman.out file, where you'll find periodic updates like this:

7/10 17:20:40  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
7/10 17:20:40   ===     ===      ===     ===     ===        ===      ===
7/10 17:20:40     1       0        0       0       2          1        0

Of course, this doesn't account for differences in resource usage between different nodes of the DAG.

You can have DAGMan re-try nodes if you want, but that doesn't allow you to manually re-start failed nodes.

If you absolutely have to have all of the jobs in the queue, though, I don't see any way to do this with DAGMan.

It looks to me like we'd have a hard time getting Condor/Dagman to support these needs. I'd love any advice / comments you might have on this.

Well, having DAGMan submit all of the jobs and then put them on hold if they're not ready (or something along those lines) would be a really
fundamental change in DAGMan, and I don't see that happening.  (And for
many users, the fact that not all of the jobs go into the queue right
away is a benefit, because it decreases the load on the Condor central manager and the schedd.)

Sorry I haven't been of more help...

Kent Wenger
Condor Team