[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Windows DAGMAN fails with "failed while reading from pipe” message



On Mon, 10 Sep 2007, Robert Mortensen wrote:

Daily, we run many thousands of jobs through several large "dag within a dag" dagman runs. Occassionally, (i.e. 2 to 10 jobs out of the thousands) will log "failed while reading from pipe” messages in the dagman.out file after trying to submit the JOB. DAGMAN appears to try to resubmit several times, but each one fails the same way until the submit limit is reached and the DAGMAN job fails. We've scoured the other condor logs and can find nothing that indicates any failure in the collector, negotiator, etc.

...

Questions:
- Has anyone else experienced this and have a solution?
- Is there something inherently wrong with submitting DAGMAN jobs on the condor master?

There shouldn't be anything special about submitting DAGMan jobs vs. any
other Condor jobs. In general, though, it's not a great idea to submit a lot of jobs on your master machine.

How necessary is it for you to run the DAG on the master? Avoiding that is probably the best way of dealing with this.

If you really *have* to run your DAGs on the master, you might want to upgrade to the 6.9 series -- there are some scalability improvements that
might help you out.

- Is there a way to catch the failure and have the testcase.dag restarted or resubmitted?

Well, one obvious way to do that would be to make a top-level single-node
DAG that just submits your "real" top-level DAG, and has retries for its
one node set to some non-zero value.

Note that you can also change the number of times DAGMan will re-try the
submit by setting the DAGMAN_MAX_SUBMIT_ATTEMPTS configuration macro.

Of course, those are only workarounds for the basic problem, and if your
condor_submits are failing, you'll be spending a lot of time doing retries.

Kent Wenger
Condor Team