[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] restricting the number of jobs



On Mon, 12 Oct 2009, Dillman, Kimberley A wrote:

I think this is because of "maxidle" versus "maxjob".

I saw a similar thing when I used a small number for "maxidle" like 3.

I had a dagman of 10 and tried to limit "maxidle" to 3. Since it appears that dagman submits 5 jobs at a time, you still get 5 but not 10 since a job isn't "idle" until it's actually submitted. Since my dagman submitted 5 jobs at a time, you can't get 3 so I get 5 but at least I don't get all 10.

You can control the maximum number of submits per cycle with the
DAGMAN_MAX_SUBMITS_PER_INTERVAL configuration variable. As you surmise, the default for this is five. If you're trying for really fine-grained control of the number of jobs in the queue, you may want to set a lower value.

However, if I use "maxjobs" set to 3. I get just 3 submitted at a time since dagman can figure out maxjobs prior to submitting them.

At least I think that's how it works because that is what I interpreted from the documentation and what I saw happen on my 10 job test dagman script.

Yes, that's correct -- DAGMan doesn't consider jobs idle until they're submitted. One thing to keep in mind is that maxidle is kind of a rough setting. Jobs that are running can become idle, so you're not guaranteed to never exceed the maxidle setting. DAGMan never *removes* jobs from the queue to try to maintain the maxidle setting -- it just throttles how fast it submits them.

However, the documentation appears to indicate that "maxjobs" will only count each "job" in the dagman script and doesn't count each individual "job" within a single submit script (i.e. queue 500 for instance would count as 1 "job"). It does indicate that the "maxidle" option (since it looks at each job in the queue separately for counting purposes) will throttle a "single" job in dagman with many individual "jobs" (i.e. queue xxx) since it counts them AFTER they are submitted so it looks at them as individual jobs. Kind of confusing but that is why the exact "maxidle" doesn't appear to work to the exact number since the dagman submits the jobs in "groups" of x (5 in my case but I'm not sure exactly where that comes from yet) and "maxidle" doesn't take affect until after they are submitted to suppress additional submissions.

Yes, that's right. Just keep in mind that maxjobs is a "harder" limit than maxidle is.

Kind of confusing but it seems to work okay especially if you keep the "maxidle" to something greater than whatever the single group submission number is for a dag.

If anyone can explain how this works in more detail, I would love to hear about it to save some time on experimentation to figure it out. :-)

Okay, here's an explanation that hopefully makes some sense...

One thing to keep in mind with both maxidle and maxjobs is that they were
really designed to work in the situation where each submit file queues a single job. DAGMan's ability to even handle submit files that queue more than one job is a pretty recent addition. Also, DAGMan can only throttle things at the level of a submit file -- so if you have submit files that queue 10 jobs, the smallest increment that DAGMan can work with is 10 jobs...

The difference in how maxjobs and maxidle count things is more an artifact of how the implementation works for multiple-job submit files than a design decision.

So the background on maxidle is this: we had implemented maxjobs, but a user who has a big pool, shared among a number of users, wanted something more flexible than maxjobs. The problem was, they couldn't really predict ahead of time what maxjobs should be set to, because it depended on the load that other users were putting on their pool. So eventually someone said, "how about if DAGMan keeps submitting jobs until the jobs aren't getting run?" and that was how the idea of maxidle was born. The idea is
that you set maxidle to some fraction of the number of machines in your
pool, and DAGMan will keep feeding jobs in to keep the pool fully utilized, without flooding the queue with lots of jobs that won't run for a long time. (We have users running 500k node DAGs, so they really don't want to submit, say 100k jobs at one shot.)

So if you want a strict limit on the number of jobs running, you should use maxjobs, but if you want to maximize the utilization of your pool, you should use maxidle.

Kent Wenger
Condor Team