Re: [Condor-users] jobs stuck in queue

Em segunda-feira 22 agosto 2011, às 15:16:28, Koller, Garrett escreveu:
> Mr. Cannini,
> Oh, I think I'm beginning to see the problem.  Look at the StartLog and note 
the authentication errors:
> > 08/19/11 17:21:30 PERMISSION DENIED to unauthenticated@unmapped from host
> > for command 442 (REQUEST_CLAIM), access level DAEMON:
> > reason: DAEMON authorization policy contains no matching ALLOW entry for
> > this request; identifiers used
> The "unauthenticated@unmapped" part means that you simply do not have
> authentication configured correctly.  First of all, what forms of
> authentication are you trying to use?  Run 'condor_config_val -v
> SEC_CLIENT_AUTHENTICATION_METHODS' and 'condor_config_val -v
> SEC_DEFAULT_AUTHENTICATION_METHODS' to find out. The typical forms are FS,
> FS_REMOTE, and PASSWORD.  To learn more about how they work, look at
> http://servo.cs.wlu.edu/dokuwiki/doku.php/condor/administration/authentica
> tion .  Look at http://condor.cs.wlu.edu/condor/config/condor_config_global
> for an example Condor configuration that uses authentication (Ctrl-F and
> search for "Authentication").
> Once you have authentication correctly configured, the authentication will
> allow daemons to identify themselves to Condor as "<username>@<hostname>".
>  If Condor runs as the user 'condor' (or as 'root' pretending to be
> 'condor') on the computer 'condor.cs.wlu.edu', for example, then that
> means that you need to add "condor@xxxxxxxxxxxxxxxxx" to the ALLOW_DAEMON
> configuration variable to let the daemons communicate.
> Does this make sense?  If so, does this help?


I've completely disabled auth and negotiation on both master and nodes with 
temporarily, and voilá, the "PERMISSION DENIED" messages stopped.

Now i've found the following message in master's SchedLog:
08/23/11 18:35:25 (pid:8028) Inserting new attribute Scheduler into non-active 
cluster cid=41 acid=-1
08/23/11 18:37:45 (pid:8028) Attempting to chown 
'/var/spool/condor/41/0/cluster41.proc0.subproc0' from 1000 to 0.0, but the 
path was unexpectedly owned by 104

Both master and nodes daemons are configured to run as root:root, so this 
conflict seems strange. 104 is the 'condor' user id, btw.

Also, on node1 StartLog, there is this weird message:
08/23/11 19:27:15 WARNING: /var/lib/condor/execute root-squashed or not 
condor-owned: requiring world-writability

But '/var/lib/condor/execute/' permissions are 'condor:condor drwx-rwx-rwt' .
Weird, huh ?

And still, any job that i submit using "universe = parallel' keeps getting 
stuck in the queue.