Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Windows DAGMAN fails with "failed while reading from pipe” message

Date: Mon, 10 Sep 2007 14:13:52 -0500 (CDT)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] Windows DAGMAN fails with "failed while reading from pipe” message

On Mon, 10 Sep 2007, Robert Mortensen wrote:

Daily, we run many thousands of jobs through several large "dag within a dag"dagman runs. Occassionally, (i.e. 2 to 10 jobs out of the thousands) will log"failed while reading from pipe” messages in the dagman.out file after tryingto submit the JOB. DAGMAN appears to try to resubmit several times, but eachone fails the same way until the submit limit is reached and the DAGMAN jobfails. We've scoured the other condor logs and can find nothing thatindicates any failure in the collector, negotiator, etc.
...

Questions:
- Has anyone else experienced this and have a solution?
- Is there something inherently wrong with submitting DAGMAN jobs on thecondor master?


There shouldn't be anything special about submitting DAGMan jobs vs. any

other Condor jobs. In general, though, it's not a great idea to submit alot of jobs on your master machine.

How necessary is it for you to run the DAG on the master? Avoiding thatis probably the best way of dealing with this.

If you really *have* to run your DAGs on the master, you might want toupgrade to the 6.9 series -- there are some scalability improvements that

might help you out.

- Is there a way to catch the failure and have the testcase.dag restarted orresubmitted?


Well, one obvious way to do that would be to make a top-level single-node
DAG that just submits your "real" top-level DAG, and has retries for its
one node set to some non-zero value.

Note that you can also change the number of times DAGMan will re-try the
submit by setting the DAGMAN_MAX_SUBMIT_ATTEMPTS configuration macro.

Of course, those are only workarounds for the basic problem, and if your

condor_submits are failing, you'll be spending a lot of time doingretries.


Kent Wenger
Condor Team

References:
- [Condor-users] Windows DAGMAN fails with "failed while reading from pipe” message
  - From: Robert Mortensen

Prev by Date: [Condor-users] Windows DAGMAN fails with "failed while reading from pipe” message
Next by Date: Re: [Condor-users] condorview not showing resources
Previous by thread: [Condor-users] Windows DAGMAN fails with "failed while reading from pipe” message
Next by thread: [Condor-users] Local universe scheduling
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Windows DAGMAN fails with "failed while reading from pipe” message