[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs stuck in queue



Em sexta-feira 19 agosto 2011, às 19:09:54, Koller, Garrett escreveu:
> Mr. Cannini,
> 
> I'm not yet familiar with running MPI jobs on Condor, but I think I've come
> across a similar situation.  First of all, run 'condor_q -better-analyze'
> to figure out if the job's requirements are causing it to not be scheduled
> in the first place.  If it says "not yet considered by matchmaker" or
> something, it usually means that it is being run but encounters an error
> shortly thereafter and so is continuously put back on the queue.  Check
> the MatchLog.  If it keeps saying that the same job is "Matched", it means
> that the job successfully scheduled but something goes wrong with the
> execute machine.  Check which slot and what machine the job is assigned
> to.  Go to the log files of that machine and look for the StarterLog for
> that slot.  The bottom of that log should tell you what error you program
> encountered that caused it to exit.  Let me/us know if this doesn't help
> you diagnose and solve the problem.
> 
> Best Regards,
>  - Garrett

Hi.

'condor_q -better-analyze 35' says this:

-- Submitter: master.internal.domain : <172.17.8.121:42584> : 
master.internal.domain
===============================
---
035.000:  Run analysis summary.  Of 0 machines,
      0 are rejected by your job's requirements 
      0 reject your job because of their own requirements 
      0 match but are serving users with a better priority in the pool 
      0 match but reject the job for unknown reasons 
      0 match but will not currently preempt their existing job 
      0 match but are currently offline 
      0 are available to run your job

WARNING:  Be advised:
   No resources matched request's constraints

WARNING:  Be advised:   Request 35.0 did not match any resource's constraints
===============================


The StartLog of both nodes has messages like this:
+++++++++++++++++++++++++++++++
08/19/11 17:21:30 slot3: match_info called
08/19/11 17:21:30 slot3: Received match <172.17.8.51:56215>#1313779372#3#...
08/19/11 17:21:30 slot3: State change: match notification protocol successful
08/19/11 17:21:30 slot3: Changing state: Unclaimed -> Matched
08/19/11 17:21:30 PERMISSION DENIED to unauthenticated@unmapped from host 
172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: 
DAEMON authorizatio
n policy contains no matching ALLOW entry for this request; identifiers used 
for this host: 172.17.8.121,master,master.internal.domain,internal.domain
08/19/11 17:21:51 slot4: match_info called
08/19/11 17:21:51 slot4: Received match <172.17.8.51:56215>#1313779372#4#...
08/19/11 17:21:51 slot4: State change: match notification protocol successful
08/19/11 17:21:51 slot4: Changing state: Unclaimed -> Matched
08/19/11 17:21:51 PERMISSION DENIED to unauthenticated@unmapped from host 
172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: 
cached result for DAEMON; see first case for the full reason
+++++++++++++++++++++++++++++++



TIA