[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor job submission delayed
- Date: Thu, 02 Sep 2004 10:29:21 -0500
- From: Alain Roy <roy@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Condor job submission delayed
Marc Saric wrote: (I'm collating multiple emails.)
Submission of jobs works in principle (tested it with the
hello-world-examples from http://www.liv.ac.uk/e-science/condor/hello.html
but sometimes I observe a strange behaviour in that certain jobs need a
very long time until they are beeing executed.
Around 30 min.
That is definitely too long, unless you have thousands of jobs and
computers, and even then it's too long.
045.000: Run analysis summary. Of 12 machines,
~ 1 are rejected by your job's requirements
~ 6 reject your job because of their own requirements
~ 0 match, but are serving users with a better priority in the pool
~ 4 match, match, but reject the job for unknown reasons
That message is unfortunately pretty undescriptive. Yes, we need to improve
it. There are a few common things that this may indicate:
1) The negotiator is currently matching jobs to machines, and this is a
transitory state. I don't think this is the case for you.
2) You have a worse priority than another user on the system. Again, I
don't think this is the case.
3) Ummm... Something else. A lot of things can go wrong, and it's really
hard to properly figure them out from condor_q.
Given that your jobs eventually match and you have a small, mostly idle
pool, this message seems odd.
| 300 second delays can occur if the new job started while condor was
| within a 20 seconds frame of the negotiation cycle. you can start the
| job by using condor_reschedule.
| you can reduce the time by lowering the NEGOTIATOR_INTERVAL value, but
| the 20 seconds timeframe is fixed, so for a 60 second interval you have
| a 33% chance that your job must wait up to a minute.
That's clear to me, I did not expect the queued jobs to be executed
within a few seconds, but (recaling my first mail) it was unclear to me,
why sometimes (not allways) jobs don't get executed for 10-30 minutes
while "condor_status" lists a lot of machines (all of them Windows in my
case) as Unclaimed/Idle during that period.
The negotiator will do a matchmaking cycle every five minutes, unless you
submit a job with condor_submit: then it will start a new cycle, unless the
last cycle began less than twenty seconds previously. If you are submitting
individual test jobs, then jobs should be matched quickly. Not instantly:
it always takes them a while to start up, but certainly much quicker than
30 minutes. I think that NEGOTIATOR_INTERVAL won't help you here.
Here are a couple of plausible things we should look at:
1) Sometimes a job does match quickly and get submitted to a computer, but
as soon as it starts up, an error is encountered and the job dies. Unless
you do condor_q at exactly the right moment, it seems to always be idle,
but it was in the running state for a brief moment.
You can see if this is the case: look in the user log for the job
(specified with the "log = X" in your submit file) and see if it says the
job was started up multiple times. If it was, then we need to figure out
why it died when starting up. If this is the case, please share your user
log. If not, try #2:
2) Try this: submit a new job and see what time you submitted it. Let it
sit idle for a few minutes, then do "condor_reschedule" to force another
matchmaking cycle. Now go to the computer that is running the negotiator,
look in the NegotiatorLog and the MatchLog, beginning at the time that you
submitted the job. Share this portion of the logs with me and/or the
mailing list (I don't want the entire log--just from the time you started
the test). This may indicate why there is a problem.
3) I've seen an intermittent problem when people submit a single job and it
doesn't get matched, but if they submit multiple jobs it does work. This
happened in rare circumstance, and I'm pretty sure it was fixed. That said,
try submitting two or three jobs, and see if your jobs run.
I suspect that #1 or #2 will help us narrow down the problem. If not, we
can figure out something else that will help.