[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Speeding up condor_submit (was Speeding up DAGman submits)




I asked:
What test job feature are you referring to?

And David Konerding responded:
Section 3.6 of the manual caught my eye (there is somethign wrong about the letter f in the manual PDF that causes it to be pasted weirdly):


Test-job Policy Example
This example shows how the default macros can be used to set up a machine for running test jobs
from a specic user. Suppose we want the machine to behave normally, except if user coltrane
submits a job. In that case, we want that job to start regardless of what is happening on the machine.
We do not want the job suspended, vacated or killed. This is reasonable if we know coltrane is
submitting very short running programs for testing purposes. The jobs should be executed right
away. This works with any machine (or the whole pool, for that matter) by adding the following 5
expressions to the existing conguration:
START = ($(START)) || Owner == "coltrane"
SUSPEND = ($(SUSPEND)) && Owner != "coltrane"
CONTINUE = $(CONTINUE)
PREEMPT = ($(PREEMPT)) && Owner != "coltrane"
KILL = $(KILL)

This will not speed up job submissions. It will help to ensure that a particular user can always submit to a computer and that his jobs will run without interruption.


Nevertheless, the observation I've made is this:

1) if I submit a DAG element (a specific .job file) with condor_submit, it runs nearly immediately (within a second). This is almost certainly due
to the matchmaking cycling starting when I submit the job, and being matched quickly. That's working great.


2) If I submit the full DAG, which then submits the same .job file, that job sits idle for 20-25 seconds before reaching the run state.

Without looking at log files, it's hard to say exactly what is happening. Are you talking about running a simple DAG with just one job in it? Are you talking about the time it takes DAGMan to start a job after a previous one finishes?


There are a couple of relevant issues:

1) When DAGMan starts up, it looks at all of the submit files of the jobs to find out which log files they use. It also reads the complete DAG and creates a representation of it in memory. For small DAGs, these are very fast tasks. For large DAGS, you will notice a lag while these happen before DAGMan submits the first job.

2) DAGMan only checks the logs every five seconds. So when a job finishes, it may be several seconds before DAGMan notices that it finished. That doesn't account for 20-25 seconds, but it may be part of the delay.

Working from this observation and the observation in #1 above, I suspect that when DAGman submits the .job file, it does not invoke a new matchmaking cycle. I never have seen it take more than 20-25 seconds so I don't think the negotiator time interval of 300 seconds is an issue here.

If DAGMan didn't start a new matchmaking cycle, then you would see wait times that were much longer than 25 seconds.


3) I assume that file transfer only happens once the job is running, not when it is listed as idle. If that's not th ecase, then I suspect that (along with several other aspects that are noted on the mailing list to affect job startup time) could shave a short amount of time off the job start.

I'm not sure if the file transfer happens during the Idle or Running phase.


At this point it sounds like I need to do a bit more peering at the log files in real time as well as running strace on the Condor daemons to see what
time interval they are providing to select().

If you send me the log files (don't send them to the whole list) I will be happy to look at them and see if I can understand why there is the difference.


One more thing that I noticed in the Condor manual, is that DAGman jobs are submitted to the scheduler universe and thus always run immediately on the local machine. It seems that should I be able to make my .job file submit to the scheduler universe and see no time delay between dagman submitting the job and it running.

Right, if you don't mind running on the local computer.


Yet, there is still a five second interval between submission and execution (that probably explains the 5 second component of the 25 seconds).

5/11 09:33:32 submitting: condor_submit -a 'dag_node_name = A' -a '+DAGManJobID
= 244.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' ls.job 2>&1
5/11 09:33:32 assigned Condor ID (245.0.0)
5/11 09:33:32 Just submitted 1 job this cycle...
5/11 09:33:32 Event: ULOG_SUBMIT for Condor Job A (245.0.0)
5/11 09:33:32 Of 2 nodes total:
5/11 09:33:32 Done Pre Queued Post Ready Un-Ready Failed
5/11 09:33:32 === === === === === === ===
5/11 09:33:32 0 0 1 0 0 1 0
5/11 09:33:37 Event: ULOG_EXECUTE for Condor Job A (245.0.0)
5/11 09:33:37 Event: ULOG_JOB_TERMINATED for Condor Job A (245.0.0)

DAGMan only checks the logs every five seconds. This value is not configurable, though I suppose we could do that. You should look in the user job log for slightly more accurate timings.


-alain


Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>