[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] Speeding up condor_submit (was Speeding up DAGman submits)
- Date: Fri, 14 May 2004 16:33:54 -0500
- From: Alain Roy <roy@xxxxxxxxxxx>
- Subject: Re: [condor-users] Speeding up condor_submit (was Speeding up DAGman submits)
What test job feature are you referring to?
And David Konerding responded:
Section 3.6 of the manual caught my eye (there is somethign wrong about
the letter f in the manual PDF that causes it to be pasted weirdly):
Test-job Policy Example
This example shows how the default macros can be used to set up a machine
for running test jobs
from a specic user. Suppose we want the machine to behave normally, except
if user coltrane
submits a job. In that case, we want that job to start regardless of what
is happening on the machine.
We do not want the job suspended, vacated or killed. This is reasonable if
we know coltrane is
submitting very short running programs for testing purposes. The jobs
should be executed right
away. This works with any machine (or the whole pool, for that matter) by
adding the following 5
expressions to the existing conguration:
START = ($(START)) || Owner == "coltrane"
SUSPEND = ($(SUSPEND)) && Owner != "coltrane"
CONTINUE = $(CONTINUE)
PREEMPT = ($(PREEMPT)) && Owner != "coltrane"
KILL = $(KILL)
This will not speed up job submissions. It will help to ensure that a
particular user can always submit to a computer and that his jobs will run
Nevertheless, the observation I've made is this:
1) if I submit a DAG element (a specific .job file) with condor_submit, it
runs nearly immediately (within a second). This is almost certainly due
to the matchmaking cycling starting when I submit the job, and being
matched quickly. That's working great.
2) If I submit the full DAG, which then submits the same .job file, that
job sits idle for 20-25 seconds before reaching the run state.
Without looking at log files, it's hard to say exactly what is happening.
Are you talking about running a simple DAG with just one job in it? Are you
talking about the time it takes DAGMan to start a job after a previous one
There are a couple of relevant issues:
1) When DAGMan starts up, it looks at all of the submit files of the jobs
to find out which log files they use. It also reads the complete DAG and
creates a representation of it in memory. For small DAGs, these are very
fast tasks. For large DAGS, you will notice a lag while these happen before
DAGMan submits the first job.
2) DAGMan only checks the logs every five seconds. So when a job finishes,
it may be several seconds before DAGMan notices that it finished. That
doesn't account for 20-25 seconds, but it may be part of the delay.
Working from this observation and the observation in #1 above, I suspect
that when DAGman submits the .job file, it does not invoke a new
matchmaking cycle. I never have seen it take more than 20-25 seconds so I
don't think the negotiator time interval of 300 seconds is an issue here.
If DAGMan didn't start a new matchmaking cycle, then you would see wait
times that were much longer than 25 seconds.
3) I assume that file transfer only happens once the job is running, not
when it is listed as idle. If that's not th ecase, then I suspect that
(along with several other aspects that are noted on the mailing list to
affect job startup time) could shave a short amount of time off the job start.
I'm not sure if the file transfer happens during the Idle or Running phase.
At this point it sounds like I need to do a bit more peering at the log
files in real time as well as running strace on the Condor daemons to see what
time interval they are providing to select().
If you send me the log files (don't send them to the whole list) I will be
happy to look at them and see if I can understand why there is the difference.
One more thing that I noticed in the Condor manual, is that DAGman jobs
are submitted to the scheduler universe and thus always run immediately on
the local machine. It seems that should I be able to make my .job file
submit to the scheduler universe and see no time delay between dagman
submitting the job and it running.
Right, if you don't mind running on the local computer.
Yet, there is still a five second interval between submission and
execution (that probably explains the 5 second component of the 25 seconds).
5/11 09:33:32 submitting: condor_submit -a 'dag_node_name = A' -a
= 244.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' ls.job 2>&1
5/11 09:33:32 assigned Condor ID (245.0.0)
5/11 09:33:32 Just submitted 1 job this cycle...
5/11 09:33:32 Event: ULOG_SUBMIT for Condor Job A (245.0.0)
5/11 09:33:32 Of 2 nodes total:
5/11 09:33:32 Done Pre Queued Post Ready Un-Ready Failed
5/11 09:33:32 === === === === === === ===
5/11 09:33:32 0 0 1 0 0 1 0
5/11 09:33:37 Event: ULOG_EXECUTE for Condor Job A (245.0.0)
5/11 09:33:37 Event: ULOG_JOB_TERMINATED for Condor Job A (245.0.0)
DAGMan only checks the logs every five seconds. This value is not
configurable, though I suppose we could do that. You should look in the
user job log for slightly more accurate timings.
Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>