[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGMan low submission rate



Hi Henning,

One trick we did in CMS is to wrap condor_submit with a small bash script that uses an environment variable to disable the implicit condor_reschedule.  This way, raw "condor_submit" gets the default behavior while dagman-based submits are more optimized.

It's been two years, so I forget the raw speedup - but I recall it being quite impressive.

Brian

PS - see also https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5164

On Jun 27, 2017, at 1:30 AM, Henning Fehrmann <henning.fehrmann@xxxxxxxxxx> wrote:

Hello,

one of our users started a DAG with 1.2M nodes which do not depend on
other nodes. It seems that in average 120 jobs are submitted per
minute. This number is strongly fluctuating. These jobs are relatively fast
executed so that the idle queue is mainly empty. Resources are
available and we are not hitting configuration limits:

DAGMAN_MAX_JOBS_IDLE = 500
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 1000
POLLING_INTERVAL = 5

We don't know whether this is the right direction to look but attaching
strace to the DAGMan shows a peculiar behavior:

strace -tt -f -e trace=file,read,write  -p 3299064

<snip>
3299064 07:46:51.207186 write(3, "06/27/17 05:46:51 755994       0"..., 79) = 79
3299064 07:46:51.207243 write(3, "06/27/17 05:46:51 0 job proc(s) "..., 47) = 47
3299064 07:46:51.208907 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3288874, si_uid=5051, si_status=0, si_utime=2, si_stime=7} ---
3299064 07:46:51.208931 write(5, "!", 1) = 1
3299064 07:46:51.209049 read(4, "!", 8) = 1
3299064 07:46:51.209066 read(4, 0x7ffe87a31040, 8) = -1 EAGAIN (Resource temporarily unavailable)
3299064 07:46:53.211173 read(4, 0x7ffe87a31040, 8) = -1 EAGAIN (Resource temporarily unavailable)
3299064 07:46:53.211228 stat("O1_BKG_C02_iMRA_20000yrs_level6-10_FULLSTAGE.dag.halt", 0x7ffe87a30bc0) = -1 ENOENT (No such file or directory)
3299064 07:46:53.213527 write(3, "06/27/17 05:46:53 Submitting HTC"..., 62) = 62
3299064 07:46:53.213643 write(3, "06/27/17 05:46:53 Adding a DAGMa"..., 181) = 181
</snip>

ls -l /proc/3299064/fd/[4,5]
lr-x------ 1 user atlas 64 Jun 27 07:21 /proc/3299064/fd/4 -> pipe:[19812062]
l-wx------ 1 user atlas 64 Jun 27 07:21 /proc/3299064/fd/5 -> pipe:[19812062]

It seems that reading the pipe stops the DAGMan process always for 2
seconds before submitting jobs.

Might this be the cause of the low submission rate? We would like to
increase this rate to fill our empty slots better.

If it helps we can increase the debug flags.

condor_version
$CondorVersion: 8.6.3 May 08 2017 BuildID: 404928 $
$CondorPlatform: x86_64_Debian8 $

(condor_)gathered information can be found here:
https://wolke7.aei.mpg.de/index.php/s/7mabqyRw5QXSU3v/download

Cheers,
Henning
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/