[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] new DAGman problems in Condor 6.8.4



On Mon, 23 Jul 2007, Smith, Ian wrote:

Thanks for the speedy reply on this. I'm not quite sure if you understand
what I'm doing. The resubmit.pl script forks another processs. The parent
exits immediately so that DAGMan sees the POST script as completed. The child
sleeps for 30s (to allow Condor to clear up the various output/log files) then
submits the same .dag file (hence the recursion) and exits. This has worked
fine for about two years with condor 6.6.5. There has been a increase in the
number of jobs recently so perhaps the system load is such that previous
DAG job hasn't gone away before the next was submitted. I'll try increasing
the sleep time to see if this improves things. Could you think of any other
reason.

Well, it may be that something has changed in daemoncore between 6.6.5
and 6.8.4 that makes it more sensitive to the setup you have.

You're right, I didn't understand that you were resubmitting the whole
DAG, as opposed to resubmitting the individual job.

I think this error message gives a pretty good clue:

    ERROR "Create_Process: More ancestor environment IDs found than
    PIDENVID_MAX which is currently 32. Programmer Error." at line 6466 in
    file daemon_core.C

Create_Process() is called by DAGMan to fork off the POST script.  I'll
bet that you're hitting this if you recurse more than 32 times. It wouldn't surprise me if something there changed since 6.6.5. (I don't
do a lot of work on daemoncore, so I'm not fully up-to-date on that.)

At any rate, this doesn't sound like something that sleeping longer will
fix.

PIDENVID_MAX is a #define in the code, so you have no way to change that.


Here's what I'd recommend:  generate a DAG with something like 100 nodes
(some number that's more than the max number of retries you'll need),
each except the last with the ABORT-DAG-ON setting as I mentioned in my
previous email.  The job for each node should just be the job in your
existing DAG, not the entire DAG. If you do this, I think you should get the functionality you need, without actually resorting to recursion.
The POST script on your last node should return 1 if you need more tries,
so then you'll know you ran out of retries.

(I know that it would be much nicer to actually be able to loop, but that functionality is not in DAGMan so far.)

Kent Wenger
Condor Team