Advance warning: I'm a Condor newbie. I’ve
been tasked with evaluating Condor as a queueing platform for CG &
Animation production at our facility.
We've set up one manager and two submitter/executer
machines. I've been submitting a dagman job similar to the diamond.dag example
in the documentation.
It seems that condor_q only shows the dag jobs that running
or that CAN run. Jobs that are blocked due to prequisite jobs not done
yet, are not shown. Also, if I keep the jobs in the system and attempt to
re-run them (using condor_hold & condor_release), then the dependencies are
not obeyed during this subsequent “run”
If I understand what’s going on, the dagman job
appears to be simply a job that submits other jobs, and the downstream jobs are
not even submitted until their prerequisite jobs have run. Subsequent
runs can only be run again with the .rescue file. Is this correct?
This has consequences for us because in our business, deadlines
are critical and resource utilization must be maximized. So progressive
estimates of completion and remaining work are necessary. We need all nodes
of the entire DAG to be present in the system to estimate resource use, even
though many of the nodes may be blocked waiting for prerequisite nodes. Jobs
that submit other jobs are a nasty surprise for our resource managers. Also
re-running of any node and its dependent nodes is fairly common and is often done
many times during pipeline troubleshooting—we don’t want to have to
re-submit the entire DAG several times in separate runs because there may be
long-running nodes in the DAG we want to continue in parallel while we’re
working on other “broken” nodes.
It looks to me like we’d have a hard time getting
Condor/Dagman to support these needs. I’d love any advice /
comments you might have on this.
Thanks very much,