Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Inspiral dags die in morgane for unexplained reasons

Date: Tue, 1 Apr 2008 18:19:33 +0200 (CEST)
From: Lucia Santamaria <lucia.santamaria@xxxxxxxxxx>
Subject: [Condor-users] Inspiral dags die in morgane for unexplained reasons


Dear condor experts,

I am trying to run the inspiral analysis in morgane at AEI. The inspiral pythonscript ihope creates a dag that is condor_submit_dag'ed to the cluster andtriggers the inspiral analysis end to end (segFind, dataFind, tmpltbank,inspiral,

plots, etc).

The cluster is running Condor 7.0.0 RHEL3 binaries
and the OS is 64-bit Debian Etch.
(We know of the special Debian Etch build of Condor, btw)

The below-described behaviour has been observed in _both_
standard
and vanilla universes:

The dag is submitted with
$ condor_submit_dag ihope.dag
after setting
$ export _CONDOR_DAGMAN_LOG_ON_NFS_IS_ERROR=FALSE

The dag runs lalapps_tmpltbank, -inspiral, -thinca, -coire, ... jobs for about~1/2 day and then I typically get a Condor email like this:


(1st condor error email)-------
This is an automated email from the Condor system
on machine "deepthought.merlin2.aei.mpg.de".  Do not reply.

Your condor job exited with status 1.

Job: /usr/bin/condor_dagman -f -l . -Debug 3 -Lockfile
bbhinj/inspiral_hipe_bbhinj.BBHINJ.dag.lock -Condorlog
/home/lucia/playground_20080314/logs/tmp8_5prr -Dag
bbhinj/inspiral_hipe_bbhinj.BBHINJ.dag -Rescue
/.auto/home/lucia/playground_20080314/857232370-859651570/inspiral_hipe_bbhi
nj.BBHINJ.dag.rescue -UseDagDir
------------------------------

The log file mentioned there looks as follows:

1. Starts like this (normal jobs submission):

----------------------
00 (100391.000.000) 04/01 01:58:11 Job submitted from host: <10.100.200.92:6097
9>
    DAG Node: 5fde592a41be3daedacc86750956a4d2
...

001 (100391.000.000) 04/01 01:58:18 Job executing on host:<10.100.201.102:55375

...
006 (100391.000.000) 04/01 01:58:26 Image size of job updated: 417624
(...)
-----------------------

2. Continues like this (normal termination of some jobs):

-----------------------
...
005 (100391.000.000) 04/01 02:43:18 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:44:09, Sys 0 00:00:03  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:44:09, Sys 0 00:00:03  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job
...

001 (100743.000.000) 04/01 02:43:22 Job executing on host:<10.100.206.114:48666

(...)
--------------------------

3. At some point it goes like this (unexplained evicted jobs):

Note the 8 seconds between submission and abortion of the first removed job106487 !


-------------------------
..

000 (106487.000.000) 04/01 14:55:12 Job submitted from host:<10.100.200.92:6097

9>
    DAG Node: df12fc4b42b0acd77cd85dff30a864bb
...
009 (106487.000.000) 04/01 14:55:20 Job was aborted by the user.
        via condor_rm (by user lucia)
...
004 (106403.000.000) 04/01 14:55:20 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:10:20, Sys 0 00:00:03  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
(...)
-------------------------

The first job marked for abortion (106487) corresponds to the following call:

(106487 job - 1st aborted)-----
-------------------------------

JOB df12fc4b42b0acd77cd85dff30a864bbinspiral_hipe_bbhinj.tmpltbank_H2.BBHINJ.sub DIR bbhinjVARS df12fc4b42b0acd77cd85dff30a864bbmacroframecache="cache/H-H2_RDS_C03_L2-858079508-858091238.cache"

macrogpsendtime="858085404" macrochannelname="H2:LSC-STRAIN"
macrogpsstarttime="858083356"
PRIORITY df12fc4b42b0acd77cd85dff30a864bb 1
CATEGORY df12fc4b42b0acd77cd85dff30a864bb tmpltbank
------------------------------

So, it's a template bank job that, when run by hand, finalizes and returns _no_error.

Notice that exactly when the first job gets aborted (106487) another job getsevicted (106403). That is _also_ a tmpltbank job:


(106403 job - 1st evicted)-------
---------------------------------

JOB 2ac80c6519afa457e303de1e743e942cinspiral_hipe_bbhinj.tmpltbank_H2.BBHINJ.sub DIR bbhinjVARS 2ac80c6519afa457e303de1e743e942cmacroframecache="cache/H-H2_RDS_C03_L2-858037917-858054493.cache" macrogpsendtime="858045733"macrochannelname="H2:LSC-STRAIN" macrogpsstarttime="858043685"

PRIORITY 2ac80c6519afa457e303de1e743e942c 1
CATEGORY 2ac80c6519afa457e303de1e743e942c tmpltbank
-------------------------------

This continues like this, with aborted and evicted jobs.


4. At the end of the log message it looks like this:

------------------------
...
009 (106259.000.000) 04/01 14:55:25 Job was aborted by the user.
        via condor_rm (by user lucia)
...
009 (106227.000.000) 04/01 14:55:25 Job was aborted by the user.
        via condor_rm (by user lucia)
...
(END)
-------------------------

The dag doesn't die after I get this email, though. A bunch of tmpltbank,inspiral, etc jobs keep on running, and


-------------------------
lucia@deepthought:~/playground_20080314$ condor_q lucia

- Submitter: deepthought.merlin2.aei.mpg.de : <10.100.200.92:60979> : deepthoug
ht.merlin2.aei.mpg.de
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
100228.0   lucia           4/1  00:07   0+16:06:14 R  0   7.3 condor_dagman -f-
100229.0   lucia           4/1  00:07   0+16:04:43 R  0   7.3 condor_dagman -f-
100230.0   lucia           4/1  00:08   0+16:04:43 R  0   7.3 condor_dagman -f-
100231.0   lucia           4/1  00:08   0+16:03:13 R  0   7.3 condor_dagman -f-

106573.0 lucia 4/1 15:50 0+00:22:30 R 0 317.4lalapps_tmpltbank106574.0 lucia 4/1 15:51 0+00:22:09 R 0 317.4lalapps_tmpltbank106576.0 lucia 4/1 15:51 0+00:21:47 R 0 317.4lalapps_tmpltbank

(... etc, more jobs here)
----------------------

After > 1 day I get another email, similar to the one shown above.
And then a couple more, all similar

Eventually, the tmpltbank and inspiral jobs end and get out of the condor queueand condor_q only shows the first 3 condor_dagman -f- jobs, which stay thereforever (I normally kill them after ~4-5 days since they do nothing else).

A rescue dag is created, that, when re-submited, shows the same behaviour thanthe original ihope.dag.Many jobs finalize and produce output that can be plotted, but several remainthat are never completed and so is the analysis.

This is clearly not a desirable behaviour. The _same_ code run by other peoplefor other months (only difference: GPS times and software installations in theother clusters) has produced normal results in other clusters such as ligo-gridat Calech, Nemo at UWM

and possibly others.


***** Some other facts

---> one day of data was successfully run in morgane in January. Changesbetween that run and this are:

1. That submission was from morgane and this one is from deepthought(difference between this 2 nodes is the amount of RAM: ~2 Gb morgane vs ~8Gbdeepthought)

2. At that time the cluster had Condor version < 6.9.5 so then the ihope.inifile used to create the dag had the following lines _uncommented_:


(ihope.ini)---------------------
; Following are required if running on a cluster with
condor version < 6.9.5
;disable-dag-categories =
;disable-dag-priorities =
-------------------------------

3. The kernel was upgraded from version 2.6.17 then to 2.6.24.x now in _all_nodes but the one I'm submitting from (deepthought).



Any insight in what might be causing this problem is much appreciated.

Thank you very much, especially if you've made it until the end of this email.
Lucia


--
--------------------------------------------
Lucia Santamaria
Max-Planck-Institut fuer Gravitationsphysik
Albert-Einstein-Institut
Am Muehlenberg 1, 17746 Golm, Germany
Office: +49(0)331-567-7181
---------------------------------------------

Follow-Ups:
- Re: [Condor-users] Inspiral dags die in morgane for unexplained reasons
  - From: R. Kent Wenger

Prev by Date: Re: [Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"
Next by Date: Re: [Condor-users] Inspiral dags die in morgane for unexplained reasons
Previous by thread: Re: [Condor-users] Startd on workers dies just after claiming job "error opening watchdog pipe"
Next by thread: Re: [Condor-users] Inspiral dags die in morgane for unexplained reasons
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Inspiral dags die in morgane for unexplained reasons