[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dagman Newbie Questions



Hi Jeremy,

Advance warning: I'm a Condor newbie. I’ve been tasked with evaluating Condor as a queueing platform for CG & Animation production at our facility.

Please accept my most sincere condolences. :) While render management with Condor is certainly possible (and fun) it requires a different mind-set compared to using Alfred, Deadline, Rush etc.

If I understand what’s going on, the dagman job appears to be simply a job that submits other jobs, and the downstream jobs are not even submitted until their prerequisite jobs have run. Subsequent runs can only be run again with the .rescue file. Is this correct?

Yes. DAGMan is just a simple condor job that submits child jobs when the dependencies allow and executes pre/post scripts. To re-run a branch or even a single job of the DAG you have to resubmit it, either using the rescue dag functionality or script your own command.

This has consequences for us because in our business, deadlines are critical and resource utilization must be maximized. So progressive estimates of completion and remaining work are necessary. We need all nodes of the entire DAG to be present in the system to estimate resource use, even though many of the nodes may be blocked waiting for prerequisite nodes. Jobs that submit other jobs are a nasty surprise for our resource managers.

While I understand why you find it a limitation, in fact its a great thing for queue load and schedule balancing. When shot TDs submit a hundred render layers a night, each consisting a few hundred jobs it could easily choke the scheduler.
If you need information about the whole dag you still have multiple options:
- Add custom attributes to the dagman job that store how many tasks are in the whole dag by task type . (ribgen / mi file gen / render / composite / whatever) - Make your job progress window parse the DAG files and build the GUI based on that data instead of the queried or quill DB data. Its a simple way to get the hierarchical information.

Also re-running of any node and its dependent nodes is fairly common and is often done many times during pipeline troubleshooting—we don’t want to have to re-submit the entire DAG several times in separate runs because there may be long-running nodes in the DAG we want to continue in parallel while we’re working on other “broken” nodes.

Again, users usually don't care what happens under the hood if the GUI shows what they are interested in. You can easily script the submission of a dag fragment, and its up to the GUI to display it as part of the original job. By adding custom attributes to these partial jobs you can keep track of whats happening. Or if you want absolute control you can make your own dagman replacement that does the job submission and resubmission as you want.

It looks to me like we’d have a hard time getting Condor/Dagman to support these needs. I’d love any advice / comments you might have on this.

Solving your needs is just a matter of scripting the job execution / job monitor scripts. The puzzling complexity of job scheduling is what makes condor a tough bird to handle (compared to off-the-shelf render management software).
Its all just a matter of personal opinion off course.

Cheers,
Szabolcs