[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Inspiral dags die in morgane for unexplained reasons




Thank you so much for your prompt reply, Kent,


Does this mean that the node job user logs *are* on NFS?  If that's the
case, is it possible to move them to a local file system?  I'm not *sure*

I will ask Steffen about this possibility.


Hmm -- are you running more than one instance of the same DAG at a time?
That will almost certainly cause problems.  Also, even if you are not

I'm not sure, I hope there's some ihope expert that can answer this.
but I'd tend to think so...


Could you send the dagman.out file corresponding to this run?  That is
generally the first place to look when DAGMan has a problem.

Yes, you can have a look at it here:
http://pandora.aei.mpg.de/~lucia/ihope.dag.dagman.out

If you can also send the DAG file itself, and the entire user log file for
the node jobs, that would help diagnose things.

Yes, sorry, I tried to attach them, but they were too big for the mailing list.

* The dag:
http://pandora.aei.mpg.de/~lucia/ihope.dag

* The log file
http://pandora.aei.mpg.de/~lucia/error_log


Also: LATEST NEWS:
The rescue dag that I get now dies with this signal:

----------------
This is an automated email from the Condor system
on machine "deepthought.merlin2.aei.mpg.de".  Do not reply.

Your condor job was killed by signal 11.

Job: /usr/bin/condor_dagman -f -l . -Debug 3 -Lockfile ihope.dag.rescue.lock
-Condorlog
/.auto/home/lucia/playground_20080314/857232370-859651570/playground/inspira
l_hipe_playground.PLAYGROUND.dag.dagman.log -Dag ihope.dag.rescue -Rescue
ihope.dag.rescue.rescue
------------------

This is _not_ a new behaviour: I moved from morgane to deepthought node because _every_ dag submitted from morgane would get killed with this signal 11. We thought that submitting from deepthought solved this problem since it has more RAM than morgane (and sig 11 is known to have sth to do with memory requirements).

But now it's the first time that I'm getting a sig 11 from deepthought. And I've been sending this dag with small variations ~5 times already.

Note that I never got sig 11 from deepthought when I used 'standard' universe. This issue I'm showing now is in vanilla.

After I get 5 such sig 11 emails from Condor master, the queue looks like this:

-----------------------
lucia@deepthought:~/playground_20080314/857232370-859651570$ condor_q lucia

-- Submitter: deepthought.merlin2.aei.mpg.de : <10.100.200.92:60979> : deepthought.merlin2.aei.mpg.de
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
107014.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4 lalapps_tmpltbank 107015.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4 lalapps_tmpltbank 107016.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4 lalapps_tmpltbank 107017.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4 lalapps_tmpltbank 107018.0 lucia 4/1 18:25 0+00:36:09 R 0 317.4 lalapps_tmpltbank 107021.0 lucia 4/1 18:26 0+00:36:09 R 0 317.4 lalapps_tmpltbank 107024.0 lucia 4/1 18:26 0+00:35:47 R 0 317.4 lalapps_tmpltbank
(...)
56 jobs; 0 idle, 56 running, 0 held
-----------------------

I never saw such a thing before, no condor_dagman -f - at the beginning of the queue...

I'm sorry this is getting more complicated by the minute.

Thanks again for any help,
Lucia




Any insight in what might be causing this problem is much appreciated.

If I can get a look at the dagman.out file, that should help a lot.

Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
--------------------------------------------
Lucia Santamaria
Max-Planck-Institut fuer Gravitationsphysik
Albert-Einstein-Institut
Am Muehlenberg 1, 17746 Golm, Germany
Office: +49(0)331-567-7181
---------------------------------------------