Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman Newbie Questions

Date: Fri, 10 Oct 2008 10:33:31 -0500 (CDT)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] Dagman Newbie Questions

On Thu, 9 Oct 2008, Jeremy Yabrow wrote:

It seems that condor_q only shows the dag jobs that running or that CANrun. Jobs that are blocked due to prequisite jobs not done yet, are notshown. Also, if I keep the jobs in the system and attempt to re-runthem (using condor_hold & condor_release), then the dependencies arenot obeyed during this subsequent "run"

Yes, that's right. Once a job is submitted, DAGMan doesn't do anything toit (besides removing it if you remove the condor_dagman job itself).

If I understand what's going on, the dagman job appears to be simply ajob that submits other jobs, and the downstream jobs are not evensubmitted until their prerequisite jobs have run. Subsequent runs canonly be run again with the .rescue file. Is this correct?


Well, the first part of this is basically correct.  DAGMan does a little
more than just submitting jobs, but yes, jobs whose prerequisities are not
satisfied are not submitted to Condor.

As far as the rescue DAG goes, though, you only get a rescue DAG if theworkflow fails or if you condor_rm the condor_dagman job. If you do run

a rescue DAG, you don't re-run all of the jobs, only the ones that didn't
finish (or were not run at all) the first time around.

If you want to re-run a DAG from scratch, you need to do

    condor_submit_dag -f <whatever>.dag

This will re-run the whole DAG regardless of whether it succeeded thefirst time.

This has consequences for us because in our business, deadlines arecritical and resource utilization must be maximized. So progressiveestimates of completion and remaining work are necessary. We need allnodes of the entire DAG to be present in the system to estimate resourceuse, even though many of the nodes may be blocked waiting forprerequisite nodes. Jobs that submit other jobs are a nasty surprisefor our resource managers. Also re-running of any node and itsdependent nodes is fairly common and is often done many times duringpipeline troubleshooting-we don't want to have to re-submit the entireDAG several times in separate runs because there may be long-runningnodes in the DAG we want to continue in parallel while we're working onother "broken" nodes.

As far as estimates of completion go, you can get some idea by looking atthe dagman.out file, where you'll find periodic updates like this:


7/10 17:20:40  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
7/10 17:20:40   ===     ===      ===     ===     ===        ===      ===
7/10 17:20:40     1       0        0       0       2          1        0

Of course, this doesn't account for differences in resource usage betweendifferent nodes of the DAG.

You can have DAGMan re-try nodes if you want, but that doesn't allow youto manually re-start failed nodes.

If you absolutely have to have all of the jobs in the queue, though, Idon't see any way to do this with DAGMan.

It looks to me like we'd have a hard time getting Condor/Dagman tosupport these needs. I'd love any advice / comments you might have onthis.

Well, having DAGMan submit all of the jobs and then put them on hold ifthey're not ready (or something along those lines) would be a really

fundamental change in DAGMan, and I don't see that happening.  (And for
many users, the fact that not all of the jobs go into the queue right

away is a benefit, because it decreases the load on the Condor centralmanager and the schedd.)


Sorry I haven't been of more help...

Kent Wenger
Condor Team

References:
- [Condor-users] Dagman Newbie Questions
  - From: Jeremy Yabrow

Prev by Date: Re: [Condor-users] Dagman Newbie Questions
Next by Date: Re: [Condor-users] HYPHY parallel MPI implementation on Condor
Previous by thread: Re: [Condor-users] Dagman Newbie Questions
Next by thread: [Condor-users] DAGMan
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Dagman Newbie Questions