[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] dagman does not submit ready jobs, how to debug?



On Tue, 17 Mar 2009, Carsten Aulbert wrote:

I thought by now I understood some part of Condor, but it nevertheless
manages to surprise me every now and then ;)

Our machines are using the following DAGMAN settings:
$ condor_config_val -dump |grep DAGMAN
DAGMAN_ABORT_DUPLICATES = TRUE
DAGMAN_COPY_TO_SPOOL = TRUE
DAGMAN_MAX_JOBS_IDLE = 500
DAGMAN_MAX_JOBS_SUBMITTED = 2000
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 200
DAGMAN_PROHIBIT_MULTI_JOBS = TRUE
DAGMAN_SUBMIT_DELAY = 0
DAGMAN_SUBMIT_DEPTH_FIRST = TRUE

Right now, I've about 860 test jobs running and currently the collection
of all dags (including the überdag) looks that quite a number of jobs
are ready for submission but are not:

XXXXXXXXXXXXX  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
3/16 23:11:36   142       .        .       .       .          .        .
3/17 22:41:46  2479       .       31       .     284       1854        .
3/17 22:41:47  2442       .      226       .     485       9110        .
3/17 22:42:23  3487       .      196       .     495       5670        .
3/17 22:42:23  3545       .      135       .     555       5613        .
3/17 22:42:23  3522       .      135       .     592       5596        3
3/17 22:41:46  3620       .       98       .     453       5677        .

(. means 0)

Since no jobs are idle, I've not yet reached 2000 jobs and whatever
daemons' cycle is referenced by DAGMAN_MAX_SUBMITS_PER_INTERVAL is
probably over by far (I've been watching this for more than an hour
now), I'm running out of ideas, why only very few jobs are submitted.
The cluster itself has more than enough slots open/in backfill.

Any idea where I should continue digging?

First of all, one note: the DAGMAN_MAX_JOBS_IDLE and DAGMAN_MAX_JOBS_SUBMITTED are per-DAGMan-instance, not overall. So if you have nested DAGs, you might get more than 2000 total jobs submitted across the DAGs. Of course, that means that it's even stranger that you aren't
getting all of the ready jobs into the queue.

The first thing I would check is just the rate of submission. You can do this by looking at your dagman.out file(s), and just looking for lines
like:

  3/18 10:22:41 Submitting Condor Node B job(s)...

Are your jobs very short-running? Is it possible that the rate of submission is just not keeping up with the rate that the jobs are finishing?

Also, if you set the debug level to 4:

  condor_submit_dag ... -debug 4 ... <whatever>.dag

the dagman.out file will explicitly list every time a submission is deferred because of the throttles:

  3/18 10:22:47 Max jobs (1) already running; deferring submission of 1
  ready job.

Oh, yeah -- you don't have category throttles in your DAG, do you? If you do, deferrals for those throttles will also show up at debug level 4.

Those are the first things that occur to me...

Kent Wenger
Condor Team