[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] dagman does not submit ready jobs, how to debug?



Hi all,

I thought by now I understood some part of Condor, but it nevertheless
manages to surprise me every now and then ;)

Our machines are using the following DAGMAN settings:
$ condor_config_val -dump |grep DAGMAN
DAGMAN_ABORT_DUPLICATES = TRUE
DAGMAN_COPY_TO_SPOOL = TRUE
DAGMAN_MAX_JOBS_IDLE = 500
DAGMAN_MAX_JOBS_SUBMITTED = 2000
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 200
DAGMAN_PROHIBIT_MULTI_JOBS = TRUE
DAGMAN_SUBMIT_DELAY = 0
DAGMAN_SUBMIT_DEPTH_FIRST = TRUE

Right now, I've about 860 test jobs running and currently the collection
of all dags (including the überdag) looks that quite a number of jobs
are ready for submission but are not:

XXXXXXXXXXXXX  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
3/16 23:11:36   142       .        .       .       .          .        .
3/17 22:41:46  2479       .       31       .     284       1854        .
3/17 22:41:47  2442       .      226       .     485       9110        .
3/17 22:42:23  3487       .      196       .     495       5670        .
3/17 22:42:23  3545       .      135       .     555       5613        .
3/17 22:42:23  3522       .      135       .     592       5596        3
3/17 22:41:46  3620       .       98       .     453       5677        .

(. means 0)

Since no jobs are idle, I've not yet reached 2000 jobs and whatever
daemons' cycle is referenced by DAGMAN_MAX_SUBMITS_PER_INTERVAL is
probably over by far (I've been watching this for more than an hour
now), I'm running out of ideas, why only very few jobs are submitted.
The cluster itself has more than enough slots open/in backfill.

Any idea where I should continue digging?

Cheers

Carsten