[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] new DAGman problems in Condor 6.8.4
- Date: Mon, 23 Jul 2007 09:50:31 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] new DAGman problems in Condor 6.8.4
On Mon, 23 Jul 2007, Smith, Ian wrote:
We've been using Condor very successfully to run recursive DAGMan jobs for sometime
but recently since moving to Condor 6.8.4 I've noticed that rogue condor_dagman processes
are appearing which seem to carry on indefinitely. These can quickly swamp the server if
not cleared up.
The *.dag files have the form:
Job A M1.sub
Script POST A ./resubmit.pl
where the resubmit.pl script resubmits the Condor job if it hasn't
converged and is within the cycle limit.
Hmm -- if I'm correctly understanding what you're doing, I'm not surprised
that you are having problems in DAGMan (although I don't fully understand
yet how you'd get the specific problems you've reported).
At any rate, your resubmit script will confuse DAGMan, because DAGMan
won't know anything about the resubmitted job -- it will think the node
is finished as soon as the POST script returns. So if you have some node
that relies on the completion of node A, that node may get executed before
node A actually finishes. If node A is your only node, DAGMan will exit
while the resubmitted job is still running.
I'd recommend doing something like this in your DAG:
JOB A <whatever.sub>
SCRIPT POST A test.pl
ABORT-DAG-ON A 10 RETURN 0
JOB B <whatever.sub>
PARENT A CHILD B
Note that I have just picked the '10' here as a number unlikely to be
returned normally by a program or script.
Test.pl should return 10 (or whatever value you pick) if the job *has*
converged and does not need to be re-run. It should return 0 if the
job does need to be re-run. And, of course, 1 would indicate an
If test.pl returns 10, the DAG will be terminated (reporting success)
at that point.
If you set things up like this, you can conditionally re-run your job
under the normal control of DAGMan.
Has any one else seen this. Any suggestions as to the cause/solution would be
As far as I know, you're the only people who've tried submitting a Condor
job from a POST script like that, so as far as I known no one else has
run into this problem.
Please try the changes I've outlined, and if you still have problems, get
back to us.