Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] new DAGman problems in Condor 6.8.4
- Date: Tue, 24 Jul 2007 10:01:40 +0100
- From: "Smith, Ian" <I.C.Smith@xxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] new DAGman problems in Condor 6.8.4
Thanks for this - I'll give it a go. Is it possible to label
the nodes with numbers - this would make creating the dag
files automatically much easier.
-ian.
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of R. Kent Wenger
> Sent: 23 July 2007 16:59
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] new DAGman problems in Condor 6.8.4
>
> On Mon, 23 Jul 2007, Smith, Ian wrote:
>
> > Thanks for the speedy reply on this. I'm not quite sure if you
> > understand what I'm doing. The resubmit.pl script forks another
> > processs. The parent exits immediately so that DAGMan sees the POST
> > script as completed. The child sleeps for 30s (to allow Condor to
> > clear up the various output/log files) then submits the
> same .dag file
> > (hence the recursion) and exits. This has worked fine for about two
> > years with condor 6.6.5. There has been a increase in the number of
> > jobs recently so perhaps the system load is such that
> previous DAG job
> > hasn't gone away before the next was submitted. I'll try increasing
> > the sleep time to see if this improves things. Could you
> think of any other reason.
>
> Well, it may be that something has changed in daemoncore
> between 6.6.5 and 6.8.4 that makes it more sensitive to the
> setup you have.
>
> You're right, I didn't understand that you were resubmitting
> the whole DAG, as opposed to resubmitting the individual job.
>
> I think this error message gives a pretty good clue:
>
> ERROR "Create_Process: More ancestor environment IDs found than
> PIDENVID_MAX which is currently 32. Programmer Error."
> at line 6466 in
> file daemon_core.C
>
> Create_Process() is called by DAGMan to fork off the POST
> script. I'll bet that you're hitting this if you recurse
> more than 32 times. It wouldn't surprise me if something
> there changed since 6.6.5. (I don't do a lot of work on
> daemoncore, so I'm not fully up-to-date on that.)
>
> At any rate, this doesn't sound like something that sleeping
> longer will fix.
>
> PIDENVID_MAX is a #define in the code, so you have no way to
> change that.
>
>
> Here's what I'd recommend: generate a DAG with something
> like 100 nodes (some number that's more than the max number
> of retries you'll need), each except the last with the
> ABORT-DAG-ON setting as I mentioned in my previous email.
> The job for each node should just be the job in your existing
> DAG, not the entire DAG. If you do this, I think you should
> get the functionality you need, without actually resorting to
> recursion.
> The POST script on your last node should return 1 if you need
> more tries, so then you'll know you ran out of retries.
>
> (I know that it would be much nicer to actually be able to
> loop, but that functionality is not in DAGMan so far.)
>
> Kent Wenger
> Condor Team
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to
> condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>