[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs stuck in queue



Em segunda-feira 22 agosto 2011, às 15:16:28, Koller, Garrett escreveu:
> Mr. Cannini,
> 
> Oh, I think I'm beginning to see the problem.  Look at the StartLog and note 
the authentication errors:
> > 08/19/11 17:21:30 PERMISSION DENIED to unauthenticated@unmapped from host
> > 172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON:
> > reason: DAEMON authorization policy contains no matching ALLOW entry for
> > this request; identifiers used
> 
> The "unauthenticated@unmapped" part means that you simply do not have
> authentication configured correctly.  First of all, what forms of
> authentication are you trying to use?  Run 'condor_config_val -v
> SEC_CLIENT_AUTHENTICATION_METHODS' and 'condor_config_val -v
> SEC_DEFAULT_AUTHENTICATION_METHODS' to find out. The typical forms are FS,
> FS_REMOTE, and PASSWORD.  To learn more about how they work, look at
> http://servo.cs.wlu.edu/dokuwiki/doku.php/condor/administration/authentica
> tion .  Look at http://condor.cs.wlu.edu/condor/config/condor_config_global
> for an example Condor configuration that uses authentication (Ctrl-F and
> search for "Authentication").
> 
> Once you have authentication correctly configured, the authentication will
> allow daemons to identify themselves to Condor as "<username>@<hostname>".
>  If Condor runs as the user 'condor' (or as 'root' pretending to be
> 'condor') on the computer 'condor.cs.wlu.edu', for example, then that
> means that you need to add "condor@xxxxxxxxxxxxxxxxx" to the ALLOW_DAEMON
> configuration variable to let the daemons communicate.
> 
> Does this make sense?  If so, does this help?

Hi.

I've completely disabled auth and negotiation on both master and nodes with 
'SEC_DEFAULT_AUTHENTICATION=NEVER' and 'SEC_DEFAULT_NEGOTIATION=NEVER' 
temporarily, and voilá, the "PERMISSION DENIED" messages stopped.

Now i've found the following message in master's SchedLog:
===============================
08/23/11 18:35:25 (pid:8028) Inserting new attribute Scheduler into non-active 
cluster cid=41 acid=-1
08/23/11 18:37:45 (pid:8028) Attempting to chown 
'/var/spool/condor/41/0/cluster41.proc0.subproc0' from 1000 to 0.0, but the 
path was unexpectedly owned by 104
===============================

Both master and nodes daemons are configured to run as root:root, so this 
conflict seems strange. 104 is the 'condor' user id, btw.


Also, on node1 StartLog, there is this weird message:
===============================
08/23/11 19:27:15 WARNING: /var/lib/condor/execute root-squashed or not 
condor-owned: requiring world-writability
===============================

But '/var/lib/condor/execute/' permissions are 'condor:condor drwx-rwx-rwt' .
Weird, huh ?

And still, any job that i submit using "universe = parallel' keeps getting 
stuck in the queue.

TIA