[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 6.8.n: job scheduling process delays



I am afraid this is a bit lower on specifics than I'd normally
put into a mailing list posting but I'd b happy to dig a little
deeper if it is worth it.

I am having a bit of trouble understanding the process by which,
and  for how long it takes, a job gets to the point of execution.

It's a approx 900 2-cpu winxp execution grid with RHEL master.

Sadly, we are still running 6.8.6 from 2007 but there is
an upgrade planned (albeit within a glacial timescale) so
if this can all be explained away with a simple "I wouldn't
start from that version" line, then that's fine - but it'd
still be good to know that.

There's been very little tweaking of the underlying parameters
beyond what comes in the vanilla distribution config files.

There are ~17500 jobs in the system, of which condor_q reports 304
R(unning) and all those jobs (17500) are owned by the same user.


We are quite often seeing the

Failed to fetch ads from.

messages when running a condor_q, which some interweb thing postings
I have found suggest are tied in to "scaling" issues with the
6.8.x series, hence my willingness to be told we're way behind
the times.

Ok,

The other user's jobs take about 20 minutes to complete so there
should be free slots that my job could jump into coming along all
the time, and looking at the user priorities, I think I was still
at 0.5 whilst the other user was up in the 600s, so I figure I am
well ahead of the game on any fair-use scheduling scale,

I submit a little .BAT file test job.

My job, 51752, takes about an hour-and-a-half to get onto a machine
even though its condor_q ST shows R(unning) more or less from the
moment it is queued.

In the MatchLog, I can see five matches, including one effectively at the
point of queueing and there are a similar five entries in the NegotiatorLog,
timed (more for illustration than meant mean anything) as follows:

13:12
13:18
13:24
13:30
14:12


In the SchedLog, I can see four entries matching the above entries
that state:

Match record (<IP.AD.DR.ESS:port>, 51752, 0) deleted

13:12
13:18
13:24
13:45  (OK, bit of a delay from the 13:30 in Match and Negotiator logs)


and at

14:04

match for job 51752.0 was deleted - not forking a shadow

Finally, the entries denoting (in my reading of things) the job
actually running appear, at around

14:41

Starting add_shadow_birthdate(51752.0)
Started shadow for job 51752.0 on "<IP.AD.DR.ESS:port>", (shadow pid = 1842)

14:45

Shadow pid 1842 for job 51752.0 exited with status 100
match (<IP.AD.DR.ESS:port>#1253973945#393#...) out of jobs (cluster id
51752); relinquishing
Match record (<IP.AD.DR.ESS:port>, 51752, -1) deleted


Meanwhile in the ShadowLog we get these at

14:45

Initializing a VANILLA shadow for job 51752.0
(51752.0) (1842): Request to run on <IP.AD.DR.ESS:port> was ACCEPTED


Am I missing anything in following this job through the logs (basically
grep'ing for the job id 51752 in the log directory) or in my understanding
of condor basics, because the way I read things, I got five matches yet
my job not only did not start until the fifth but it took half-an-hour
to go from that match to the running ?

Any pointers (even to the relevant User/Admin guide pages) welcome

-- 
Kevin M. Buckley                                  Room:  CO327
School of Engineering and                         Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand