[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs fail to run, with "Warning: Found no submitters"



Thanks so much for the help.  I really appreciate it.

So I've tried all of these suggestions, and still I haven't been able to get a
job to run yet.

On Tue, Aug 16, 2005 at 04:41:44PM -0500, Zachary Miller wrote:
> On Tue, Aug 16, 2005 at 05:11:26PM -0400, Jamie Rollins wrote:
> > I think I've tracked the issue down to a permissions issues with the
> > submitting user/host.  It appears that the Collector seems to see submits on
> > the head node as coming from the outward-pointing ip address of the head
> > node, which it sees as an invalid host.  On the head node, the
> > outward-pointing interface has a an address 10.32.47.10, where as the
> > interface that all of the cluster nodes are attached to has an address of
> > 10.0.0.1.  Here is a line from the CollectorLog on the head node:
> > 
> > 8/16 14:45:19 (Sending 15 ads in response to query)
> > 8/16 14:45:19 DaemonCore: PERMISSION DENIED to unknown user from host <10.32.47.10:45781> for command 10 (QUERY_STARTD_PVT_ADS)
> 
> in this case, you should change your HOSTALLOW_ settings in the config
> file to allow IPs from both inside and outside:
> 
> HOSTALLOW_READ = 10.32.47.10 10.0.0.*
> HOSTALLOW_WRITE = 10.32.47.10 10.0.0.*

I had tried this, but I don't see how this is the issue, since the value of "*"
for the HOSTALLOW variable should allow _any_ connections, right? In any event,
changing this variable didn't do anything.

> > it's very difficult to debug when the variables are not mentioned by default
> > and the manual doesn't mention anything about the requirements of the
> > presence of these variables, or what the assumed defaults are if they are not
> > present.  
> 
> sorry, this is entirely my fault.

eh, don't blame yourself.  But along these lines, are there other variables that
aren't in the config files that need to be included and set for the system to
work?  

I don't really have a good sense of how different the system I'm setting up is
from the standard Condor install.  I have a head node which all jobs should be
submitted from, and which has all user accounts.  Then I have diskless nodes
that with be dedicated execution machines, which have NO user accounts.
Therefore the execution machines do not know anything about the users that are
submitting jobs to the head node.

Does the default setup assume that machines in the pool have access to the same
user database?  If this is the case, I understand that I would have to set
certain security variables to tell Condor to allow the execution of jobs from
unauthenticatable users.

I have set the following variables in global condor_config:
SEC_DEFAULT_AUTHENTICATION = NEVER                                                
SEC_CLIENT_AUTHENTICATION = NEVER                                                 
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE                                    
SEC_CLIENT_AUTHENTICATION_METHODS = CLAIMTOBE 

The head node/central manager condor_config.local has:
NETWORK_INTERFACE = 10.0.0.1

At this point, the following commands produce the following output ("zajos" is
the name of the head node):

--------------------
zajos:~/test> condor_submit pi2.cwd
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 33.

zajos:~/test> condor_q -analyze    
Warning:  Found no submitters


-- Submitter: zajos.cluster : <10.0.0.1:46590> : zajos.cluster
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
---
033.000:  Run analysis summary.  Of 5 machines,
      0 are rejected by your job's requirements
      2 reject your job because of their own requirements
      0 match but are serving users with a better priority in the pool
      3 match but reject the job for unknown reasons
      0 match but will not currently preempt their existing job
      0 are available to run your job

1 jobs; 1 idle, 0 running, 0 held
--------------------

with the following head node CollectorLog entry:

--------------------
8/17 10:20:36 Found ScheddIpAddr
8/17 10:20:36 Got IP = '<10.0.0.1:46590>'
8/17 10:20:36 ScheddAd     : Updating ... "< zajos.cluster , 10.0.0.1 >"
8/17 10:20:36 Found ScheddIpAddr
8/17 10:20:36 Got IP = '<10.0.0.1:46590>'
8/17 10:20:36 SubmittorAd  : Updating ... "< jrollins@xxxxxxxxxxxxxxxxxxxxxxxxxx
, 10.0.0.1 >"
8/17 10:20:36 Got QUERY_NEGOTIATOR_ADS
8/17 10:20:36 (Sending 1 ads in response to query)
8/17 10:20:36 Got QUERY_ANY_ADS
8/17 10:20:36 (Sending 16 ads in response to query)
8/17 10:20:36 DaemonCore: PERMISSION DENIED to unknown user from host
<10.32.47.10:47253> for command 10 (QUERY_STARTD_PVT_ADS)
--------------------

Sending the job directly to one of the nodes produces identical response.  It
obviously still looks like an authorization, authentication error, but I'm
really at a loss at this point how to fix it.  The "SubmittorAd" line looks very
suspicious to me as well.  That looks like a bug, or at least a misstatement in
the documentation about how some DOMAIN variable somewhere should be configured.

No one has yet commented on exactly what the "Warning:  Found no submitters"
message is from the condor_q -analyze.  Does anyone know what that means?  Could
it be a clue to what's going on?  Or is it just a simple warning of what I
already know, ie. that the job is in the queue but can't be executed because of
some sort of permissions problem.

I will be eternally grateful if anyone can help me figure out what's going on.

jamie.