[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] DAGMan low submission rate



Hello,

one of our users started a DAG with 1.2M nodes which do not depend on
other nodes. It seems that in average 120 jobs are submitted per
minute. This number is strongly fluctuating. These jobs are relatively fast
executed so that the idle queue is mainly empty. Resources are
available and we are not hitting configuration limits:

DAGMAN_MAX_JOBS_IDLE = 500
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 1000
POLLING_INTERVAL = 5

We don't know whether this is the right direction to look but attaching
strace to the DAGMan shows a peculiar behavior:

strace -tt -f -e trace=file,read,write  -p 3299064

<snip>
3299064 07:46:51.207186 write(3, "06/27/17 05:46:51 755994       0"..., 79) = 79
3299064 07:46:51.207243 write(3, "06/27/17 05:46:51 0 job proc(s) "..., 47) = 47
3299064 07:46:51.208907 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3288874, si_uid=5051, si_status=0, si_utime=2, si_stime=7} ---
3299064 07:46:51.208931 write(5, "!", 1) = 1
3299064 07:46:51.209049 read(4, "!", 8) = 1
3299064 07:46:51.209066 read(4, 0x7ffe87a31040, 8) = -1 EAGAIN (Resource temporarily unavailable)
3299064 07:46:53.211173 read(4, 0x7ffe87a31040, 8) = -1 EAGAIN (Resource temporarily unavailable)
3299064 07:46:53.211228 stat("O1_BKG_C02_iMRA_20000yrs_level6-10_FULLSTAGE.dag.halt", 0x7ffe87a30bc0) = -1 ENOENT (No such file or directory)
3299064 07:46:53.213527 write(3, "06/27/17 05:46:53 Submitting HTC"..., 62) = 62
3299064 07:46:53.213643 write(3, "06/27/17 05:46:53 Adding a DAGMa"..., 181) = 181
</snip>

ls -l /proc/3299064/fd/[4,5]
lr-x------ 1 user atlas 64 Jun 27 07:21 /proc/3299064/fd/4 -> pipe:[19812062]
l-wx------ 1 user atlas 64 Jun 27 07:21 /proc/3299064/fd/5 -> pipe:[19812062]

It seems that reading the pipe stops the DAGMan process always for 2
seconds before submitting jobs.

Might this be the cause of the low submission rate? We would like to
increase this rate to fill our empty slots better.

If it helps we can increase the debug flags.

condor_version
$CondorVersion: 8.6.3 May 08 2017 BuildID: 404928 $
$CondorPlatform: x86_64_Debian8 $

(condor_)gathered information can be found here:
https://wolke7.aei.mpg.de/index.php/s/7mabqyRw5QXSU3v/download

Cheers,
Henning