[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs stuck in queue



Mr. Cannini,

Oh, I think I'm beginning to see the problem.  Look at the StartLog and note the authentication errors:
> 08/19/11 17:21:30 PERMISSION DENIED to unauthenticated@unmapped from host
> 172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON: reason:
> DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used

The "unauthenticated@unmapped" part means that you simply do not have authentication configured correctly.  First of all, what forms of authentication are you trying to use?  Run 'condor_config_val -v SEC_CLIENT_AUTHENTICATION_METHODS' and 'condor_config_val -v SEC_DEFAULT_AUTHENTICATION_METHODS' to find out.
The typical forms are FS, FS_REMOTE, and PASSWORD.  To learn more about how they work, look at http://servo.cs.wlu.edu/dokuwiki/doku.php/condor/administration/authentication .  Look at http://condor.cs.wlu.edu/condor/config/condor_config_global for an example Condor configuration that uses authentication (Ctrl-F and search for "Authentication").

Once you have authentication correctly configured, the authentication will allow daemons to identify themselves to Condor as "<username>@<hostname>".  If Condor runs as the user 'condor' (or as 'root' pretending to be 'condor') on the computer 'condor.cs.wlu.edu', for example, then that means that you need to add "condor@xxxxxxxxxxxxxxxxx" to the ALLOW_DAEMON configuration variable to let the daemons communicate.

Does this make sense?  If so, does this help?

Best Regards,
 - Garrett Heath Koller
condor.cs.wlu.edu

________________________________________
From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] on behalf of Fabricio Cannini [fcannini@xxxxxxxxx]
Sent: Monday, August 22, 2011 1:58 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] jobs stuck in queue

Em sexta-feira 19 agosto 2011, às 19:09:54, Koller, Garrett escreveu:
> Mr. Cannini,
>
> I'm not yet familiar with running MPI jobs on Condor, but I think I've come
> across a similar situation.  First of all, run 'condor_q -better-analyze'
> to figure out if the job's requirements are causing it to not be scheduled
> in the first place.  If it says "not yet considered by matchmaker" or
> something, it usually means that it is being run but encounters an error
> shortly thereafter and so is continuously put back on the queue.  Check
> the MatchLog.  If it keeps saying that the same job is "Matched", it means
> that the job successfully scheduled but something goes wrong with the
> execute machine.  Check which slot and what machine the job is assigned
> to.  Go to the log files of that machine and look for the StarterLog for
> that slot.  The bottom of that log should tell you what error you program
> encountered that caused it to exit.  Let me/us know if this doesn't help
> you diagnose and solve the problem.
>
> Best Regards,
>  - Garrett

Hi.

'condor_q -better-analyze 35' says this:

-- Submitter: master.internal.domain : <172.17.8.121:42584> :
master.internal.domain
===============================
---
035.000:  Run analysis summary.  Of 0 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      0 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 match but are currently offline
      0 are available to run your job

WARNING:  Be advised:
   No resources matched request's constraints

WARNING:  Be advised:   Request 35.0 did not match any resource's constraints
===============================


The StartLog of both nodes has messages like this:
+++++++++++++++++++++++++++++++
08/19/11 17:21:30 slot3: match_info called
08/19/11 17:21:30 slot3: Received match <172.17.8.51:56215>#1313779372#3#...
08/19/11 17:21:30 slot3: State change: match notification protocol successful
08/19/11 17:21:30 slot3: Changing state: Unclaimed -> Matched
08/19/11 17:21:30 PERMISSION DENIED to unauthenticated@unmapped from host
172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON: reason:
DAEMON authorizatio
n policy contains no matching ALLOW entry for this request; identifiers used
for this host: 172.17.8.121,master,master.internal.domain,internal.domain
08/19/11 17:21:51 slot4: match_info called
08/19/11 17:21:51 slot4: Received match <172.17.8.51:56215>#1313779372#4#...
08/19/11 17:21:51 slot4: State change: match notification protocol successful
08/19/11 17:21:51 slot4: Changing state: Unclaimed -> Matched
08/19/11 17:21:51 PERMISSION DENIED to unauthenticated@unmapped from host
172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON: reason:
cached result for DAEMON; see first case for the full reason
+++++++++++++++++++++++++++++++



TIA
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/