[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman Newbie Questions



Thanks Szabolcs,

That's very useful information.  I am familiar with Alfred and Rush, so your comments were exactly what I was looking for.  It is great to hear from someone who understands the nature of render queueing in our industry.

It had occurred to me that we might be able to satisfy our requirements in our own layer above Condor, and it seemed like even more work on top of the submission, GUI, and resource management layers we already would have to write.  I just wanted to know if I was assessing Condor's built-in features and capabilities properly and not missing anything.

Your comment about the scheduler being "choked" is a very good point.  Systems like Rush spend way too much time iterating and communicating over enormous lists of low-level pending tasks when considering which ones to run in any cycle.  As I understand it, Alfred only considers one possible task per job in each scheduling loop iteration, which keeps the total number of tasks per loop to consider down to a manageable level.  With Condor's scheduler not having to worry about blocked downstream nodes, it should also keep the number down "naturally".

Thanks again!

Jeremy

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Horvátth Szabolcs
Sent: Friday, October 10, 2008 7:15 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Dagman Newbie Questions

Hi Jeremy,

> Advance warning: I'm a Condor newbie. I've been tasked with evaluating
> Condor as a queueing platform for CG & Animation production at our
> facility.
>
Please accept my most sincere condolences. :) While render management
with Condor is certainly possible (and fun) it requires a different
mind-set compared to using Alfred, Deadline, Rush etc.

> If I understand what's going on, the dagman job appears to be simply a
> job that submits other jobs, and the downstream jobs are not even
> submitted until their prerequisite jobs have run. Subsequent runs can
> only be run again with the .rescue file. Is this correct?
>
Yes. DAGMan is just a simple condor job that submits child jobs when the
dependencies allow and executes pre/post scripts. To re-run a branch or
even a single job of the DAG you have to resubmit it, either using the
rescue dag functionality or script your own command.
>
> This has consequences for us because in our business, deadlines are
> critical and resource utilization must be maximized. So progressive
> estimates of completion and remaining work are necessary. We need all
> nodes of the entire DAG to be present in the system to estimate
> resource use, even though many of the nodes may be blocked waiting for
> prerequisite nodes. Jobs that submit other jobs are a nasty surprise
> for our resource managers.
>
While I understand why you find it a limitation, in fact its a great
thing for queue load and schedule balancing. When shot TDs submit a
hundred render layers a night, each consisting a few hundred jobs it
could easily choke the scheduler.
If you need information about the whole dag you still have multiple options:
- Add custom attributes to the dagman job that store how many tasks are
in the whole dag by task type . (ribgen / mi file gen / render /
composite / whatever)
- Make your job progress window parse the DAG files and build the GUI
based on that data instead of the queried or quill DB data. Its a simple
way to get the hierarchical information.

> Also re-running of any node and its dependent nodes is fairly common
> and is often done many times during pipeline troubleshooting-we don't
> want to have to re-submit the entire DAG several times in separate
> runs because there may be long-running nodes in the DAG we want to
> continue in parallel while we're working on other "broken" nodes.
>
Again, users usually don't care what happens under the hood if the GUI
shows what they are interested in. You can easily script the submission
of a dag fragment, and its up to the GUI to display it as part of the
original job. By adding custom attributes to these partial jobs you can
keep track of whats happening. Or if you want absolute control you can
make your own dagman replacement that does the job submission and
resubmission as you want.

> It looks to me like we'd have a hard time getting Condor/Dagman to
> support these needs. I'd love any advice / comments you might have on
> this.
>
Solving your needs is just a matter of scripting the job execution / job
monitor scripts. The puzzling complexity of job scheduling is what makes
condor a tough bird to handle (compared to off-the-shelf render
management software).
Its all just a matter of personal opinion off course.

Cheers,
Szabolcs
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/