[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] jobs wait in idle mode unecessarily
- Date: Wed, 23 Jun 2004 11:25:56 +0800
- From: "Raymond Wong" <RaymondWong@xxxxxxxxxxxxxxxxxx>
- Subject: RE: [Condor-users] jobs wait in idle mode unecessarily
Encountered similar problem too. Noticed this especially if I am
submitting jobs from my central manager (which is a XP PC running Condor
6.6.1). However, when you mention that the job take ages to start, it
does start up utimately? For my case, jobs submitted will always miss an
negotiation cycle and get matched 5min later (the next cycle).
Anyway, noticed something really bad in your schedd log:
6/21 12:22:09 Scheduler::Relinquish - mrec is NULL, can't relinquish
6/21 12:22:09 Null parameter --- match not deleted
I think this implies that the schedd on your host has crashed! You may
want to check if the job has been successfully submitted for negotiation
in the first place!
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Wednesday, June 23, 2004 2:44 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] jobs wait in idle mode unecessarily
On Mon, Jun 21, 2004 at 12:38:15PM +0100, Dr Ian C. Smith wrote:
> It's a vanilla job and the file permissions are OK (it's under win
> 2k). Also there are no nice user options specified. Unfortunately I
> can't seem to reproduce it at the moment but I'm getting a similar
> possibly related problem that killed jobs hang around in the idle
What do you mean "killed jobs hang around in the idle state"?
> C:\Condor\ics>condor_q -analyze
> -- Submitter: 102153-71130c.liv.ac.uk : <188.8.131.52:1042> :
> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> 186.000: Run analysis summary. Of 2 machines,
> 1 are rejected by your job's requirements
> 0 reject your job because of their own requirements
> 0 match, but are serving users with a better priority in the pool
> 1 match, but prefer another specific job despite its worse
> 0 match, but will not currently preempt their existing job
> 0 are available to run your job
> Last successful match: Mon Jun 21 12:31:39 2004
> 1 jobs; 1 idle, 0 running, 0 held
> This from SchedLog looks pertinent:
> 6/21 12:22:09 DaemonCore: Command received via TCP from host
> 6/21 12:22:09 DaemonCore: received command 443 (VACATE_SERVICE),
> handler (vacate_service)
> 6/21 12:22:09 Got VACATE_SERVICE from <184.108.40.206:2309>
> 6/21 12:22:09 Sent RELEASE_CLAIM to startd on <220.127.116.11:1041>
> 6/21 12:22:09 Match record (<18.104.22.168:1041>, 183, 0) deleted
> 6/21 12:22:09 DaemonCore: Command received via UDP from host
> 6/21 12:22:09 DaemonCore: received command 60001 (DC_PROCESSEXIT),
> handler (HandleProcessExitCommand())
> 6/21 12:22:09 Scheduler::Relinquish - mrec is NULL, can't relinquish
> 6/21 12:22:09 Null parameter --- match not deleted
It is only a snippet, and not enough to tell us anything.
To debug this, the first question to ask is "does this job ever match?"
does Condor ever even try to start the job. It seems from the above that
it does - so condor_q -analyze isn't going to tell us anything more.
What would help the most would be:
1. The full schedd log
2. The shadow log
3. The job log file (ie the file that you set with 'log =
your submit file)
3. The starterlog from the execute machine.
It would also be handy to have the full output of 'condor_q -l' and
> >> 1 match, but prefer another specific job despite its worse
> >I think there are quite a number of things that cause this.
Indeed - in 6.6.6, we've changed this error message to be more (less?)
helpful - it now will say "1 match, but reject the job for unknown
Now at least it won't send you off on a wild goose chase.
Condor-users mailing list