Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] jobs wait in idle mode unecessarily

Date: Wed, 23 Jun 2004 11:25:56 +0800
From: "Raymond Wong" <RaymondWong@xxxxxxxxxxxxxxxxxx>
Subject: RE: [Condor-users] jobs wait in idle mode unecessarily

Hi,

Encountered similar problem too. Noticed this especially if I am
submitting jobs from my central manager (which is a XP PC running Condor
6.6.1). However, when you mention that the job take ages to start, it
does start up utimately? For my case, jobs submitted will always miss an
negotiation cycle and get matched 5min later (the next cycle). 

Anyway, noticed something really bad in your schedd log:

6/21 12:22:09 Scheduler::Relinquish - mrec is NULL, can't relinquish
6/21 12:22:09 Null parameter --- match not deleted

I think this implies that the schedd on your host has crashed! You may
want to check if the job has been successfully submitted for negotiation
in the first place!

Raymond Wong
System Engineer
DID: 7358
Pager: 98028590



-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Wednesday, June 23, 2004 2:44 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] jobs wait in idle mode unecessarily


On Mon, Jun 21, 2004 at 12:38:15PM +0100, Dr Ian C. Smith wrote:
> It's a vanilla job and the file permissions are OK (it's under win 
> 2k). Also there are no nice user options specified. Unfortunately I 
> can't seem to reproduce it at the moment but I'm getting a similar 
> possibly related problem that killed jobs hang around in the idle 
> state.
> 

What do you mean "killed jobs hang around in the idle state"? 

> C:\Condor\ics>condor_q -analyze
> -- Submitter: 102153-71130c.liv.ac.uk : <138.253.102.153:1042> :
> 102153-71130c.l
> iv.ac.uk
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> ---
> 186.000:  Run analysis summary.  Of 2 machines,
>      1 are rejected by your job's requirements
>      0 reject your job because of their own requirements
>      0 match, but are serving users with a better priority in the pool
>      1 match, but prefer another specific job despite its worse 
> user-priority
>      0 match, but will not currently preempt their existing job
>      0 are available to run your job
>        Last successful match: Mon Jun 21 12:31:39 2004
> 
> 1 jobs; 1 idle, 0 running, 0 held
> 
> This from SchedLog looks pertinent:
> 
> 6/21 12:22:09 DaemonCore: Command received via TCP from host
> <138.253.102.153:2309>
> 6/21 12:22:09 DaemonCore: received command 443 (VACATE_SERVICE),
calling 
> handler (vacate_service)
> 6/21 12:22:09 Got VACATE_SERVICE from <138.253.102.153:2309>
> 6/21 12:22:09 Sent RELEASE_CLAIM to startd on <138.253.102.153:1041>
> 6/21 12:22:09 Match record (<138.253.102.153:1041>, 183, 0) deleted
> 6/21 12:22:09 DaemonCore: Command received via UDP from host 
> <138.253.102.153:2311>
> 6/21 12:22:09 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling 
> handler (HandleProcessExitCommand())
> 6/21 12:22:09 Scheduler::Relinquish - mrec is NULL, can't relinquish
> 6/21 12:22:09 Null parameter --- match not deleted
> 

It is only a snippet, and not enough to tell us anything. 

To debug this, the first question to ask is "does this job ever match?"
- ie
does Condor ever even try to start the job. It seems from the above that
it does - so condor_q -analyze isn't going to tell us anything more. 

What would help the most would be:

1. The full schedd log
2. The shadow log
3. The job log file (ie the file that you set with 'log =
somelogfile.log' in
   your submit file) 
3. The starterlog from the execute machine. 

It would also be handy to have the full output of 'condor_q -l' and 
'condor_status -l'


<...>
> 
> >>      1 match, but prefer another specific job despite its worse
> >>user-priority
> >
> >I think there are quite a number of things that cause this.
> >

Indeed - in 6.6.6, we've changed this error message to be more (less?) 
helpful - it now will say "1 match, but reject the job for unknown
reasons"
Now at least it won't send you off on a wild goose chase. 

-Erik

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users

Follow-Ups:
- Re: [Condor-users] jobs wait in idle mode unecessarily
  - From: Erik Paulson

Prev by Date: Re: [Condor-users] mpi and dedicated scheduler configuration
Next by Date: [Condor-users] CHECKPOINT-SERVER instalation
Previous by thread: RE: [Condor-users] jobs wait in idle mode unecessarily
Next by thread: Re: [Condor-users] jobs wait in idle mode unecessarily
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

RE: [Condor-users] jobs wait in idle mode unecessarily