[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] new DAGman problems in Condor 6.8.4



Thanks for this - I'll give it a go. Is it possible to label
the nodes with numbers - this would make creating the dag
files automatically much easier.

-ian. 

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of R. Kent Wenger
> Sent: 23 July 2007 16:59
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] new DAGman problems in Condor 6.8.4
> 
> On Mon, 23 Jul 2007, Smith, Ian wrote:
> 
> > Thanks for the speedy reply on this. I'm not quite sure if you 
> > understand what I'm doing. The resubmit.pl script forks another 
> > processs. The parent exits immediately so that DAGMan sees the POST 
> > script as completed. The child sleeps for 30s (to allow Condor to 
> > clear up the various output/log files) then submits the 
> same .dag file 
> > (hence the recursion) and exits. This has worked fine for about two 
> > years with condor 6.6.5. There has been a increase in the number of 
> > jobs recently so perhaps the system load is such that 
> previous DAG job 
> > hasn't gone away before the next was submitted. I'll try increasing 
> > the sleep time to see if this improves things. Could you 
> think of any other reason.
> 
> Well, it may be that something has changed in daemoncore 
> between 6.6.5 and 6.8.4 that makes it more sensitive to the 
> setup you have.
> 
> You're right, I didn't understand that you were resubmitting 
> the whole DAG, as opposed to resubmitting the individual job.
> 
> I think this error message gives a pretty good clue:
> 
>      ERROR "Create_Process: More ancestor environment IDs found than
>      PIDENVID_MAX which is currently 32. Programmer Error." 
> at line 6466 in
>      file daemon_core.C
> 
> Create_Process() is called by DAGMan to fork off the POST 
> script.  I'll bet that you're hitting this if you recurse 
> more than 32 times.  It wouldn't surprise me if something 
> there changed since 6.6.5.  (I don't do a lot of work on 
> daemoncore, so I'm not fully up-to-date on that.)
> 
> At any rate, this doesn't sound like something that sleeping 
> longer will fix.
> 
> PIDENVID_MAX is a #define in the code, so you have no way to 
> change that.
> 
> 
> Here's what I'd recommend:  generate a DAG with something 
> like 100 nodes (some number that's more than the max number 
> of retries you'll need), each except the last with the 
> ABORT-DAG-ON setting as I mentioned in my previous email.  
> The job for each node should just be the job in your existing 
> DAG, not the entire DAG.  If you do this, I think you should 
> get the functionality you need, without actually resorting to 
> recursion.
> The POST script on your last node should return 1 if you need 
> more tries, so then you'll know you ran out of retries.
> 
> (I know that it would be much nicer to actually be able to 
> loop, but that functionality is not in DAGMan so far.)
> 
> Kent Wenger
> Condor Team
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to 
> condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
>