[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs removed automatically by dags?



On Wed, 18 Mar 2009, Carsten Aulbert wrote:

things are getting more mysteriously. A set of my jobs (the set of dags
from the previous email) were hindered by a flaky running schedd:

007 (10293330.000.000) 03/17 23:27:45 Shadow exception!
       Failed to connect to schedd!
       1161  -  Run Bytes Sent By Job
       6592404  -  Run Bytes Received By Job
...
007 (10293318.000.000) 03/17 23:27:45 Shadow exception!
       Failed to connect to schedd!
       1161  -  Run Bytes Sent By Job
       6592404  -  Run Bytes Received By Job
...

These I do understand and will probably restartable by the rescue dags,
however a few minutes later, when I'm not near the machines (nor the
other admin who could have the rights) this happened:

009 (10293330.000.000) 03/17 23:41:07 Job was aborted by the user.
       via condor_rm (by user carsten)
...
009 (10293324.000.000) 03/17 23:41:07 Job was aborted by the user.
       via condor_rm (by user carsten)
...
009 (10293318.000.000) 03/17 23:41:07 Job was aborted by the user.
       via condor_rm (by user carsten)
...
009 (10293342.000.000) 03/17 23:41:07 Job was aborted by the user.
       via condor_rm (by user carsten)
...
009 (10293336.000.000) 03/17 23:41:07 Job was aborted by the user.
       via condor_rm (by user carsten)
...
009 (10293348.000.000) 03/17 23:41:07 Job was aborted by the user.
       via condor_rm (by user carsten)
...
009 (10293055.000.000) 03/17 23:41:07 Job was aborted by the user.
       via condor_rm (by user carsten)
...

Will dagman condor_rm jobs on its own?

Under certain circumstances, yes. If you condor_rm the DAGMan job, it will condor_rm it's node jobs. I wonder if it's possible that the schedd problems caused the DAGMan job to get condor_rm'ed. You can find out by
looking at the dagman.out file: if DAGMan was condor_rm'ed, you should see
something like this:

  3/18 10:26:53 Received SIGUSR1
  3/18 10:26:53 Aborting DAG...
  3/18 10:26:53 Writing Rescue DAG to dag_files/diamond.dag.rescue001...

You could also look at the DAGMan jobs .dagman.log file and see what it
says.

Kent Wenger
Condor Team