[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Dagman Newbie Questions

Advance warning: I'm a Condor newbie.  I’ve been tasked with evaluating Condor as a queueing platform for CG & Animation production at our facility.


We've set up one manager and two submitter/executer machines. I've been submitting a dagman job similar to the diamond.dag example in the documentation.


My Question:


It seems that condor_q only shows the dag jobs that running or that CAN run.  Jobs that are blocked due to prequisite jobs not done yet, are not shown.  Also, if I keep the jobs in the system and attempt to re-run them (using condor_hold & condor_release), then the dependencies are not obeyed during this subsequent “run”


If I understand what’s going on, the dagman job appears to be simply a job that submits other jobs, and the downstream jobs are not even submitted until their prerequisite jobs have run.  Subsequent runs can only be run again with the .rescue file.  Is this correct?


This has consequences for us because in our business, deadlines are critical and resource utilization must be maximized.  So progressive estimates of completion and remaining work are necessary.  We need all nodes of the entire DAG to be present in the system to estimate resource use, even though many of the nodes may be blocked waiting for prerequisite nodes.  Jobs that submit other jobs are a nasty surprise for our resource managers.  Also re-running of any node and its dependent nodes is fairly common and is often done many times during pipeline troubleshooting—we don’t want to have to re-submit the entire DAG several times in separate runs because there may be long-running nodes in the DAG we want to continue in parallel while we’re working on other “broken” nodes.


It looks to me like we’d have a hard time getting Condor/Dagman to support these needs.  I’d love any advice / comments you might have on this.



Thanks very much,



Jeremy Yabrow

Production Engineering