[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.8.n: job scheduling process delays

Kevin -

You may want to check the StartLog on the machine in question. It appears that there may be some network issues between the shadow and starter.

When the shadow returned 100, I believe that is the OS errno.

For Linux that is:
#define ENETDOWN        100     /* Network is down */

Also, if "<IP.AD.DR.ESS:port>" is the actual value in the logs then you likely have some condor_config issues (again check your execute node), which I believe could lend to the afore mentioned error.

Hope this helps,

On 10/12/2009 09:51 PM, Kevin.Buckley@xxxxxxxxxxxxx wrote:

I am afraid this is a bit lower on specifics than I'd normally
put into a mailing list posting but I'd b happy to dig a little
deeper if it is worth it.

I am having a bit of trouble understanding the process by which,
and  for how long it takes, a job gets to the point of execution.

It's a approx 900 2-cpu winxp execution grid with RHEL master.

Sadly, we are still running 6.8.6 from 2007 but there is
an upgrade planned (albeit within a glacial timescale) so
if this can all be explained away with a simple "I wouldn't
start from that version" line, then that's fine - but it'd
still be good to know that.

There's been very little tweaking of the underlying parameters
beyond what comes in the vanilla distribution config files.

There are ~17500 jobs in the system, of which condor_q reports 304
R(unning) and all those jobs (17500) are owned by the same user.

We are quite often seeing the

Failed to fetch ads from.

messages when running a condor_q, which some interweb thing postings
I have found suggest are tied in to "scaling" issues with the
6.8.x series, hence my willingness to be told we're way behind
the times.


The other user's jobs take about 20 minutes to complete so there
should be free slots that my job could jump into coming along all
the time, and looking at the user priorities, I think I was still
at 0.5 whilst the other user was up in the 600s, so I figure I am
well ahead of the game on any fair-use scheduling scale,

I submit a little .BAT file test job.

My job, 51752, takes about an hour-and-a-half to get onto a machine
even though its condor_q ST shows R(unning) more or less from the
moment it is queued.

In the MatchLog, I can see five matches, including one effectively at the
point of queueing and there are a similar five entries in the NegotiatorLog,
timed (more for illustration than meant mean anything) as follows:


In the SchedLog, I can see four entries matching the above entries
that state:

Match record (<IP.AD.DR.ESS:port>, 51752, 0) deleted

13:45  (OK, bit of a delay from the 13:30 in Match and Negotiator logs)

and at


match for job 51752.0 was deleted - not forking a shadow

Finally, the entries denoting (in my reading of things) the job
actually running appear, at around


Starting add_shadow_birthdate(51752.0)
Started shadow for job 51752.0 on "<IP.AD.DR.ESS:port>", (shadow pid = 1842)


Shadow pid 1842 for job 51752.0 exited with status 100
match (<IP.AD.DR.ESS:port>#1253973945#393#...) out of jobs (cluster id
51752); relinquishing
Match record (<IP.AD.DR.ESS:port>, 51752, -1) deleted

Meanwhile in the ShadowLog we get these at


Initializing a VANILLA shadow for job 51752.0
(51752.0) (1842): Request to run on<IP.AD.DR.ESS:port>  was ACCEPTED

Am I missing anything in following this job through the logs (basically
grep'ing for the job id 51752 in the log directory) or in my understanding
of condor basics, because the way I read things, I got five matches yet
my job not only did not start until the fifth but it took half-an-hour
to go from that match to the running ?

Any pointers (even to the relevant User/Admin guide pages) welcome