[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] new DAGman problems in Condor 6.8.4



Thanks for the speedy reply on this. I'm not quite sure if you understand
what I'm doing. The resubmit.pl script forks another processs. The parent
exits immediately so that DAGMan sees the POST script as completed. The child
sleeps for 30s (to allow Condor to clear up the various output/log files) then
submits the same .dag file (hence the recursion) and exits. This has worked
fine for about two years with condor 6.6.5. There has been a increase in the
number of jobs recently so perhaps the system load is such that previous
DAG job hasn't gone away before the next was submitted. I'll try increasing
the sleep time to see if this improves things. Could you think of any other
reason.

regards,

-ian.

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of R. Kent Wenger
> Sent: 23 July 2007 15:51
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] new DAGman problems in Condor 6.8.4
> 
> On Mon, 23 Jul 2007, Smith, Ian wrote:
> 
> > We've been using Condor very successfully to run recursive 
> DAGMan jobs 
> > for sometime but recently since moving to Condor 6.8.4 I've noticed 
> > that rogue condor_dagman processes are appearing which seem 
> to carry 
> > on indefinitely. These can quickly swamp the server if not 
> cleared up.
> >
> > The *.dag files have the form:
> >
> > Job A M1.sub
> > Script POST A  ./resubmit.pl
> >
> > where the resubmit.pl script resubmits the Condor job if it hasn't 
> > converged and is within the cycle limit.
> 
> Hmm -- if I'm correctly understanding what you're doing, I'm 
> not surprised that you are having problems in DAGMan 
> (although I don't fully understand yet how you'd get the 
> specific problems you've reported).
> 
> At any rate, your resubmit script will confuse DAGMan, 
> because DAGMan won't know anything about the resubmitted job 
> -- it will think the node is finished as soon as the POST 
> script returns.  So if you have some node that relies on the 
> completion of node A, that node may get executed before node 
> A actually finishes.  If node A is your only node, DAGMan 
> will exit while the resubmitted job is still running.
> 
> I'd recommend doing something like this in your DAG:
> 
>      JOB A <whatever.sub>
>      SCRIPT POST A test.pl
>      ABORT-DAG-ON A 10 RETURN 0
> 
>      JOB B <whatever.sub>
> 
>      PARENT A CHILD B
> 
> Note that I have just picked the '10' here as a number 
> unlikely to be returned normally by a program or script.
> 
> Test.pl should return 10 (or whatever value you pick) if the 
> job *has* converged and does not need to be re-run.  It 
> should return 0 if the job does need to be re-run.  And, of 
> course, 1 would indicate an error.
> 
> If test.pl returns 10, the DAG will be terminated (reporting 
> success) at that point.
> 
> If you set things up like this, you can conditionally re-run 
> your job under the normal control of DAGMan.
> 
> > Has any one else seen this. Any suggestions as to the 
> cause/solution 
> > would be most appreciated.
> 
> As far as I know, you're the only people who've tried 
> submitting a Condor job from a POST script like that, so as 
> far as I known no one else has run into this problem.
> 
> Please try the changes I've outlined, and if you still have 
> problems, get back to us.
> 
> Kent Wenger
> Condor Team
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to 
> condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
>