[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Inspiral dags die in morgane for unexplained reasons




Dear condor experts,

I am trying to run the inspiral analysis in morgane at AEI. The inspiral python script ihope creates a dag that is condor_submit_dag'ed to the cluster and triggers the inspiral analysis end to end (segFind, dataFind, tmpltbank, inspiral,
plots, etc).

The cluster is running Condor 7.0.0 RHEL3 binaries
and the OS is 64-bit Debian Etch.
(We know of the special Debian Etch build of Condor, btw)

The below-described behaviour has been observed in _both_
standard
and vanilla universes:

The dag is submitted with
$ condor_submit_dag ihope.dag
after setting
$ export _CONDOR_DAGMAN_LOG_ON_NFS_IS_ERROR=FALSE

The dag runs lalapps_tmpltbank, -inspiral, -thinca, -coire, ... jobs for about ~1/2 day and then I typically get a Condor email like this:

(1st condor error email)-------
This is an automated email from the Condor system
on machine "deepthought.merlin2.aei.mpg.de".  Do not reply.

Your condor job exited with status 1.

Job: /usr/bin/condor_dagman -f -l . -Debug 3 -Lockfile
bbhinj/inspiral_hipe_bbhinj.BBHINJ.dag.lock -Condorlog
/home/lucia/playground_20080314/logs/tmp8_5prr -Dag
bbhinj/inspiral_hipe_bbhinj.BBHINJ.dag -Rescue
/.auto/home/lucia/playground_20080314/857232370-859651570/inspiral_hipe_bbhi
nj.BBHINJ.dag.rescue -UseDagDir
------------------------------

The log file mentioned there looks as follows:

1. Starts like this (normal jobs submission):

----------------------
00 (100391.000.000) 04/01 01:58:11 Job submitted from host: <10.100.200.92:6097
9>
    DAG Node: 5fde592a41be3daedacc86750956a4d2
...
001 (100391.000.000) 04/01 01:58:18 Job executing on host: <10.100.201.102:55375

...
006 (100391.000.000) 04/01 01:58:26 Image size of job updated: 417624
(...)
-----------------------

2. Continues like this (normal termination of some jobs):

-----------------------
...
005 (100391.000.000) 04/01 02:43:18 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:44:09, Sys 0 00:00:03  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:44:09, Sys 0 00:00:03  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...
001 (100743.000.000) 04/01 02:43:22 Job executing on host: <10.100.206.114:48666

(...)
--------------------------

3. At some point it goes like this (unexplained evicted jobs):
Note the 8 seconds between submission and abortion of the first removed job 106487 !

-------------------------
..
000 (106487.000.000) 04/01 14:55:12 Job submitted from host: <10.100.200.92:6097
9>
    DAG Node: df12fc4b42b0acd77cd85dff30a864bb
...
009 (106487.000.000) 04/01 14:55:20 Job was aborted by the user.
        via condor_rm (by user lucia)
...
004 (106403.000.000) 04/01 14:55:20 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:10:20, Sys 0 00:00:03  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
(...)
-------------------------

The first job marked for abortion (106487) corresponds to the following call:

(106487 job - 1st aborted)-----
-------------------------------
JOB df12fc4b42b0acd77cd85dff30a864bb inspiral_hipe_bbhinj.tmpltbank_H2.BBHINJ.sub DIR bbhinj VARS df12fc4b42b0acd77cd85dff30a864bb macroframecache="cache/H-H2_RDS_C03_L2-858079508-858091238.cache"
macrogpsendtime="858085404" macrochannelname="H2:LSC-STRAIN"
macrogpsstarttime="858083356"
PRIORITY df12fc4b42b0acd77cd85dff30a864bb 1
CATEGORY df12fc4b42b0acd77cd85dff30a864bb tmpltbank
------------------------------

So, it's a template bank job that, when run by hand, finalizes and returns _no_ error.

Notice that exactly when the first job gets aborted (106487) another job gets evicted (106403). That is _also_ a tmpltbank job:

(106403 job - 1st evicted)-------
---------------------------------
JOB 2ac80c6519afa457e303de1e743e942c inspiral_hipe_bbhinj.tmpltbank_H2.BBHINJ.sub DIR bbhinj VARS 2ac80c6519afa457e303de1e743e942c macroframecache="cache/H-H2_RDS_C03_L2-858 037917-858054493.cache" macrogpsendtime="858045733" macrochannelname="H2:LSC-STRAIN" macrogpsstarttime="858043685"
PRIORITY 2ac80c6519afa457e303de1e743e942c 1
CATEGORY 2ac80c6519afa457e303de1e743e942c tmpltbank
-------------------------------

This continues like this, with aborted and evicted jobs.


4. At the end of the log message it looks like this:

------------------------
...
009 (106259.000.000) 04/01 14:55:25 Job was aborted by the user.
        via condor_rm (by user lucia)
...
009 (106227.000.000) 04/01 14:55:25 Job was aborted by the user.
        via condor_rm (by user lucia)
...
(END)
-------------------------

The dag doesn't die after I get this email, though. A bunch of tmpltbank, inspiral, etc jobs keep on running, and

-------------------------
lucia@deepthought:~/playground_20080314$ condor_q lucia

- Submitter: deepthought.merlin2.aei.mpg.de : <10.100.200.92:60979> : deepthoug
ht.merlin2.aei.mpg.de
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
100228.0   lucia           4/1  00:07   0+16:06:14 R  0   7.3 condor_dagman -f-
100229.0   lucia           4/1  00:07   0+16:04:43 R  0   7.3 condor_dagman -f-
100230.0   lucia           4/1  00:08   0+16:04:43 R  0   7.3 condor_dagman -f-
100231.0   lucia           4/1  00:08   0+16:03:13 R  0   7.3 condor_dagman -f-
106573.0 lucia 4/1 15:50 0+00:22:30 R 0 317.4 lalapps_tmpltbank 106574.0 lucia 4/1 15:51 0+00:22:09 R 0 317.4 lalapps_tmpltbank 106576.0 lucia 4/1 15:51 0+00:21:47 R 0 317.4 lalapps_tmpltbank
(... etc, more jobs here)
----------------------

After > 1 day I get another email, similar to the one shown above.
And then a couple more, all similar
Eventually, the tmpltbank and inspiral jobs end and get out of the condor queue and condor_q only shows the first 3 condor_dagman -f- jobs, which stay there forever (I normally kill them after ~4-5 days since they do nothing else).

A rescue dag is created, that, when re-submited, shows the same behaviour than the original ihope.dag. Many jobs finalize and produce output that can be plotted, but several remain that are never completed and so is the analysis.

This is clearly not a desirable behaviour. The _same_ code run by other people for other months (only difference: GPS times and software installations in the other clusters) has produced normal results in other clusters such as ligo-grid at Calech, Nemo at UWM
and possibly others.


***** Some other facts

---> one day of data was successfully run in morgane in January. Changes between that run and this are:

1. That submission was from morgane and this one is from deepthought (difference between this 2 nodes is the amount of RAM: ~2 Gb morgane vs ~8Gb deepthought)

2. At that time the cluster had Condor version < 6.9.5 so then the ihope.ini file used to create the dag had the following lines _uncommented_:

(ihope.ini)---------------------
; Following are required if running on a cluster with
condor version < 6.9.5
;disable-dag-categories =
;disable-dag-priorities =
-------------------------------

3. The kernel was upgraded from version 2.6.17 then to 2.6.24.x now in _all_ nodes but the one I'm submitting from (deepthought).


Any insight in what might be causing this problem is much appreciated.

Thank you very much, especially if you've made it until the end of this email.
Lucia


--
--------------------------------------------
Lucia Santamaria
Max-Planck-Institut fuer Gravitationsphysik
Albert-Einstein-Institut
Am Muehlenberg 1, 17746 Golm, Germany
Office: +49(0)331-567-7181
---------------------------------------------