[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs stuck in queue
- Date: Tue, 23 Aug 2011 20:09:28 -0300
- From: Fabricio Cannini <fcannini@xxxxxxxxx>
- Subject: Re: [Condor-users] jobs stuck in queue
Em segunda-feira 22 agosto 2011, às 15:16:28, Koller, Garrett escreveu:
> Mr. Cannini,
> Oh, I think I'm beginning to see the problem. Look at the StartLog and note
the authentication errors:
> > 08/19/11 17:21:30 PERMISSION DENIED to unauthenticated@unmapped from host
> > 172.17.8.121 for command 442 (REQUEST_CLAIM), access level DAEMON:
> > reason: DAEMON authorization policy contains no matching ALLOW entry for
> > this request; identifiers used
> The "unauthenticated@unmapped" part means that you simply do not have
> authentication configured correctly. First of all, what forms of
> authentication are you trying to use? Run 'condor_config_val -v
> SEC_CLIENT_AUTHENTICATION_METHODS' and 'condor_config_val -v
> SEC_DEFAULT_AUTHENTICATION_METHODS' to find out. The typical forms are FS,
> FS_REMOTE, and PASSWORD. To learn more about how they work, look at
> tion . Look at http://condor.cs.wlu.edu/condor/config/condor_config_global
> for an example Condor configuration that uses authentication (Ctrl-F and
> search for "Authentication").
> Once you have authentication correctly configured, the authentication will
> allow daemons to identify themselves to Condor as "<username>@<hostname>".
> If Condor runs as the user 'condor' (or as 'root' pretending to be
> 'condor') on the computer 'condor.cs.wlu.edu', for example, then that
> means that you need to add "condor@xxxxxxxxxxxxxxxxx" to the ALLOW_DAEMON
> configuration variable to let the daemons communicate.
> Does this make sense? If so, does this help?
I've completely disabled auth and negotiation on both master and nodes with
'SEC_DEFAULT_AUTHENTICATION=NEVER' and 'SEC_DEFAULT_NEGOTIATION=NEVER'
temporarily, and voilá, the "PERMISSION DENIED" messages stopped.
Now i've found the following message in master's SchedLog:
08/23/11 18:35:25 (pid:8028) Inserting new attribute Scheduler into non-active
cluster cid=41 acid=-1
08/23/11 18:37:45 (pid:8028) Attempting to chown
'/var/spool/condor/41/0/cluster41.proc0.subproc0' from 1000 to 0.0, but the
path was unexpectedly owned by 104
Both master and nodes daemons are configured to run as root:root, so this
conflict seems strange. 104 is the 'condor' user id, btw.
Also, on node1 StartLog, there is this weird message:
08/23/11 19:27:15 WARNING: /var/lib/condor/execute root-squashed or not
condor-owned: requiring world-writability
But '/var/lib/condor/execute/' permissions are 'condor:condor drwx-rwx-rwt' .
Weird, huh ?
And still, any job that i submit using "universe = parallel' keeps getting
stuck in the queue.