[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Speeding up condor_submit (was Speeding up DAGman submits)




we are submitting a small DAG with very few jobs to DAGman (test jobs such as ls) and noticed that it takes about 20-25 seconds for the CONDOR job that DAGman submits to the queue to go from the idle to running state.

I changed the subject of the message because the issue is with condor_submit, not with DAGMan. (By the way, it's Condor, not CONDOR--it's not an acronym.)


I've tried to figure out which condor_config file options affect this time (by reading the comments in condor_config and the manual section defining all the parameters), but haven't had much luck.

There is no magical "speed up Condor" option. :)


First, let me recommend reading a short post made to condor-users a while ago by Doug Thain:

http://www.cs.wisc.edu/~lists/archive/condor-users/msg00919.html

In part, he says:

Please keep in mind that Condor is a high *throughput* system designed to
execute large workloads over long time periods.  It is *not* designed to be
a low latency system that executes a single job quickly.  Condor performs a
large number of expensive operations in order to maximize scalability and
reliability at the expense of latency.

Take this to heart. Condor is targeted at high throughput, not high performance. Condor is not tuned to start up jobs in seconds. If you need reliability and scalability, Condor is a good match.


1) using the 'test job feature' for fast turnaround time. Can this be applied to DAGman jobs?

What test job feature are you referring to?


2) Computing on Demand.

You're right--this is not a good match with DAGMan.


Here are some factors that affect the speed of starting up a job:

1) Are there lots of jobs or computers? Deciding where a job should run requires matchmaking. The more jobs or computers you have, the slower this process is. It is possible to speed up this process in some cases when you have lots of jobs in your queue. Holler if this is related.

2) The matchmaking cycle runs every five minutes, except when jobs are submitted. When you submit a job, it will start a new matchmaking cycle as soon as it can (perhaps it's already in the middle of matchmaking) unless it started a matchmaking cycle within the last 20 (25?) seconds. This number is tunable, but the point is that matchmaking doesn't happen constantly.

3) The time to actually start up your job. This can be affected by all the usual suspects: your network, the computer, how much data needs to be transferred to the computer (do you transfer files?), the speed of shared file systems like NFS, etc.

In general, we recommend running a few large jobs rather than many small jobs. You will get better throughput, and the bumps in performance (like 30 second startup times) won't matter so much.

If you really need interactive startup, then COD is the way to go, but COD doesn't mix with DAGMan well.

Some collaborators at Technion in Israel have been working on low latency invocation in Condor. I'm not sure of the status of their work, but you might want to talk to them if it's important enough to you.

http://dsl.cs.technion.ac.il/projects/gozal/

I hope this helps.

-alain


Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>