[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Speeding up condor_submit (was Speeding up DAGmansubmits)

First, let me recommend reading a short post made to condor-users a while ago by Doug Thain:


In part, he says:

Please keep in mind that Condor is a high *throughput* system designed to
execute large workloads over long time periods. It is *not* designed to be
a low latency system that executes a single job quickly. Condor performs a
large number of expensive operations in order to maximize scalability and
reliability at the expense of latency.

Take this to heart. Condor is targeted at high throughput, not high performance. Condor is not tuned to start up jobs in seconds. If you need reliability and scalability, Condor is a good match.

Thanks for the response, Alain.

I'm actually quite familiar with the latency and throughput issues associate with batch computing systems. That said, my personal observations suggest
that there is some improvement that could be made with respect to DAGman (see below).

1) using the 'test job feature' for fast turnaround time. Can this be applied to DAGman jobs?

What test job feature are you referring to?

Section 3.6 of the manual caught my eye (there is somethign wrong about the letter f in the manual PDF that causes it to be pasted weirdly):

Test-job Policy Example
This example shows how the default macros can be used to set up a machine for running test jobs
from a specic user. Suppose we want the machine to behave normally, except if user coltrane
submits a job. In that case, we want that job to start regardless of what is happening on the machine.
We do not want the job suspended, vacated or killed. This is reasonable if we know coltrane is
submitting very short running programs for testing purposes. The jobs should be executed right
away. This works with any machine (or the whole pool, for that matter) by adding the following 5
expressions to the existing conguration:
START = ($(START)) || Owner == "coltrane"
SUSPEND = ($(SUSPEND)) && Owner != "coltrane"
PREEMPT = ($(PREEMPT)) && Owner != "coltrane"

2) The matchmaking cycle runs every five minutes, except when jobs are submitted. When you submit a job, it will start a new matchmaking cycle as soon as it can (perhaps it's already in the middle of matchmaking) unless it started a matchmaking cycle within the last 20 (25?) seconds. This number is tunable, but the point is that matchmaking doesn't happen constantly.

OK, this is highly appropriate (I hadn't realized job submission started a new matchmaking cycle, although I should have picked that up from the manual).

Nevertheless, the observation I've made is this:

1) if I submit a DAG element (a specific .job file) with condor_submit, it runs nearly immediately (within a second). This is almost certainly due
to the matchmaking cycling starting when I submit the job, and being matched quickly. That's working great.

2) If I submit the full DAG, which then submits the same .job file, that job sits idle for 20-25 seconds before reaching the run state.
Working from this observation and the observation in #1 above, I suspect that when DAGman submits the .job file, it does not invoke a new matchmaking cycle. I never have seen it take more than 20-25 seconds so I don't think the negotiator time interval of 300 seconds is an issue here. Given that our cluster is small, and matchmaking probably doesn't take very long, would reducing the negotiator time interval to 1 make it likely that jobs would go from
idle to running more quickly?

3) I assume that file transfer only happens once the job is running, not when it is listed as idle. If that's not th ecase, then I suspect that (along with several other aspects that are noted on the mailing list to affect job startup time) could shave a short amount of time off the job start.

At this point it sounds like I need to do a bit more peering at the log files in real time as well as running strace on the Condor daemons to see what
time interval they are providing to select().

One more thing that I noticed in the Condor manual, is that DAGman jobs are submitted to the scheduler universe and thus always run immediately on the local machine. It seems that should I be able to make my .job file submit to the scheduler universe and see no time delay between dagman submitting the job and it running.

Yet, there is still a five second interval between submission and execution (that probably explains the 5 second component of the 25 seconds).

5/11 09:33:32 submitting: condor_submit -a 'dag_node_name = A' -a '+DAGManJobID
= 244.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' ls.job 2>&1
5/11 09:33:32 assigned Condor ID (245.0.0)
5/11 09:33:32 Just submitted 1 job this cycle...
5/11 09:33:32 Event: ULOG_SUBMIT for Condor Job A (245.0.0)
5/11 09:33:32 Of 2 nodes total:
5/11 09:33:32 Done Pre Queued Post Ready Un-Ready Failed
5/11 09:33:32 === === === === === === ===
5/11 09:33:32 0 0 1 0 0 1 0
5/11 09:33:37 Event: ULOG_EXECUTE for Condor Job A (245.0.0)
5/11 09:33:37 Event: ULOG_JOB_TERMINATED for Condor Job A (245.0.0)

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>