[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Feature request: non-integer DAGMAN_SUBMIT_DELAY



> What is actually causing the trouble for Lustre?
>
> 1. Submission of jobs, or
> 2. Running of Dagman Pre scripts on the submit nodes, or
> 3. Running of the actual jobs on the execute nodes ?

> Actually, it's #1. It looks like Lustre has trouble handling large
bursts of stat, open, etc. system calls that occur when a dagman submits
a large number (1k+) of jobs at once whose submit files and logs are
stored on a Lustre filesystem if its metadata server is already under
high load. When Lustre starts to stutter, the entire submit machine gets
slow, condor_q times out, etc. The regular condor_submit can cause
similar problems, but that happens less often, for reasons I am not
totally sure about.

Hmm.  Unless you've changed the default config settings, the maximum rate of submission from a single DAGMan is 5 submits every 5 seconds (see DAGMAN_USER_LOG_SCAN_INTERVAL and DAGMAN_MAX_SUBMITS_PER_INTERVAL).

Are your submit files generating multi-job clusters?  If so, a non-zero submit delay isn't going to help you, because the delay is only between condor_submit invocations.

If you've changed DAGMAN_USER_LOG_SCAN_INTERVAL and/or DAGMAN_MAX_SUBMITS_PER_INTERVAL, you might try tweaking those back toward the default settings (5 seconds and 5 submits per interval).

Kent Wenger
CHTC Team