[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Inspiral dags die in morgane for unexplained reasons



On Tue, 1 Apr 2008, Lucia Santamaria wrote:

I am trying to run the inspiral analysis in morgane at AEI. The inspiral python
script ihope creates a dag that is condor_submit_dag'ed to the cluster and
triggers the inspiral analysis end to end (segFind, dataFind, tmpltbank,
inspiral,
plots, etc).

...

The dag is submitted with
$ condor_submit_dag ihope.dag
after setting
$ export _CONDOR_DAGMAN_LOG_ON_NFS_IS_ERROR=FALSE

Does this mean that the node job user logs *are* on NFS? If that's the case, is it possible to move them to a local file system? I'm not *sure*
if that's the cause of the problem, but it's one of the firs things we
look at when there is unexplained DAGMan behavior.

The dag runs lalapps_tmpltbank, -inspiral, -thinca, -coire, ... jobs for about
~1/2 day and then I typically get a Condor email like this:

(1st condor error email)-------
This is an automated email from the Condor system
on machine "deepthought.merlin2.aei.mpg.de".  Do not reply.

Your condor job exited with status 1.

...

-------------------------
lucia@deepthought:~/playground_20080314$ condor_q lucia

- Submitter: deepthought.merlin2.aei.mpg.de : <10.100.200.92:60979> : deepthoug
ht.merlin2.aei.mpg.de
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
100228.0   lucia           4/1  00:07   0+16:06:14 R  0   7.3 condor_dagman -f-
100229.0   lucia           4/1  00:07   0+16:04:43 R  0   7.3 condor_dagman -f-
100230.0   lucia           4/1  00:08   0+16:04:43 R  0   7.3 condor_dagman -f-
100231.0   lucia           4/1  00:08   0+16:03:13 R  0   7.3 condor_dagman -f-
106573.0   lucia           4/1  15:50   0+00:22:30 R  0   317.4
lalapps_tmpltbank
106574.0   lucia           4/1  15:51   0+00:22:09 R  0   317.4
lalapps_tmpltbank
106576.0   lucia           4/1  15:51   0+00:21:47 R  0   317.4
lalapps_tmpltbank
(... etc, more jobs here)
----------------------

Hmm -- are you running more than one instance of the same DAG at a time? That will almost certainly cause problems. Also, even if you are not running multiple instances of the same DAG, if you're running several DAGs having node jobs that write to the same log file, you could possibly have
problems.

Could you send the dagman.out file corresponding to this run? That is generally the first place to look when DAGMan has a problem.

If you can also send the DAG file itself, and the entire user log file for the node jobs, that would help diagnose things.

Any insight in what might be causing this problem is much appreciated.

If I can get a look at the dagman.out file, that should help a lot.

Kent Wenger
Condor Team