[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Windows DAGMAN fails with "failed while reading from pipe” message



Daily, we run many thousands of jobs through several large "dag within a dag" dagman runs. Occassionally, (i.e. 2 to 10 jobs out of the thousands) will log "failed while reading from pipe” messages in the dagman.out file after trying to submit the JOB. DAGMAN appears to try to resubmit several times, but each one fails the same way until the submit limit is reached and the DAGMAN job fails. We've scoured the other condor logs and can find nothing that indicates any failure in the collector, negotiator, etc.

The dag submissions and the condor master are on the same Windows 2000 Server machine. When the dags are submitted from a different Windows XP machine to the same condor master, even at the same time as the ones giving us problems, things seem to be OK (at least we can't recall seeing this problem which this scenario). Resubmitting the same .dag and .sub files in the same way at another time will work just fine. All files are local to the submitting machine.

We have a master DAG with a couple thousand dag JOBS, that is, master.dag contains:
	JOB dir1 dir1/testcase.dag.condor.sub
	JOB dir2 dir2/testcase.dag.condor.sub
	JOB dir3 dir3/testcase.dag.condor.sub
		... and so on ...

The testcase.dag files contain:
	JOB rmt_ dir1 dir1/testcase.sub
	SCRIPT PRE rmt_dir1 prepare.bat <args....>
	SCRIPT POST rmt_dir1 process.bat <args...>

Questions:
- Has anyone else experienced this and have a solution?
- Is there something inherently wrong with submitting DAGMAN jobs on the condor master? - Is there a way to catch the failure and have the testcase.dag restarted or resubmitted?

For anyone interested in delving further into this I've attached examples of testcase.sub, testcase.dag and all the outputs (including testcase.dag.dagman.out) for a run that failed (I've not included the master.dag).

Thanks,
Bob Mortensen

Attachment: testcase.dag.lib.stdout
Description: application/applefile

Attachment: testcase.dag
Description: Binary data

Attachment: testcase.sub
Description: Binary data

Attachment: testcase.dag.condor.sub
Description: Binary data

Attachment: testcase.dag.dagman.out
Description: Binary data

Attachment: testcase.dag.lib.stderr
Description: application/applefile