[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs wait in idle mode unecessarily



On Mon, Jun 21, 2004 at 12:38:15PM +0100, Dr Ian C. Smith wrote:
> It's a vanilla job and the file permissions are OK (it's
> under win 2k). Also there are no nice user options
> specified. Unfortunately I can't seem to reproduce it at
> the moment but I'm getting a similar possibly related
> problem that killed jobs hang around in the idle state.
> 

What do you mean "killed jobs hang around in the idle state"? 

> C:\Condor\ics>condor_q -analyze
> -- Submitter: 102153-71130c.liv.ac.uk : <138.253.102.153:1042> : 
> 102153-71130c.l
> iv.ac.uk
> ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> ---
> 186.000:  Run analysis summary.  Of 2 machines,
>      1 are rejected by your job's requirements
>      0 reject your job because of their own requirements
>      0 match, but are serving users with a better priority in the pool
>      1 match, but prefer another specific job despite its worse 
> user-priority
>      0 match, but will not currently preempt their existing job
>      0 are available to run your job
>        Last successful match: Mon Jun 21 12:31:39 2004
> 
> 1 jobs; 1 idle, 0 running, 0 held
> 
> This from SchedLog looks pertinent:
> 
> 6/21 12:22:09 DaemonCore: Command received via TCP from host 
> <138.253.102.153:2309>
> 6/21 12:22:09 DaemonCore: received command 443 (VACATE_SERVICE), calling 
> handler (vacate_service)
> 6/21 12:22:09 Got VACATE_SERVICE from <138.253.102.153:2309>
> 6/21 12:22:09 Sent RELEASE_CLAIM to startd on <138.253.102.153:1041>
> 6/21 12:22:09 Match record (<138.253.102.153:1041>, 183, 0) deleted
> 6/21 12:22:09 DaemonCore: Command received via UDP from host 
> <138.253.102.153:2311>
> 6/21 12:22:09 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling 
> handler (HandleProcessExitCommand())
> 6/21 12:22:09 Scheduler::Relinquish - mrec is NULL, can't relinquish
> 6/21 12:22:09 Null parameter --- match not deleted
> 

It is only a snippet, and not enough to tell us anything. 

To debug this, the first question to ask is "does this job ever match?" - ie
does Condor ever even try to start the job. It seems from the above that
it does - so condor_q -analyze isn't going to tell us anything more. 

What would help the most would be:

1. The full schedd log
2. The shadow log
3. The job log file (ie the file that you set with 'log = somelogfile.log' in
   your submit file) 
3. The starterlog from the execute machine. 

It would also be handy to have the full output of 'condor_q -l' and 
'condor_status -l'


<...>
> 
> >>      1 match, but prefer another specific job despite its worse
> >>user-priority
> >
> >I think there are quite a number of things that cause this.
> >

Indeed - in 6.6.6, we've changed this error message to be more (less?) 
helpful - it now will say "1 match, but reject the job for unknown reasons"
Now at least it won't send you off on a wild goose chase. 

-Erik