[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs fail to run, with "Warning: Found no submitters"



Thanks so much for the reply, Mike.

I don't think it's a disk issue, since the nodes seem to be reporting more disk
space than the job is requiring.  I will be aware of that in the future, though.

I think I've tracked the issue down to a permissions issues with the submitting
user/host.  It appears that the Collector seems to see submits on the head node
as coming from the outward-pointing ip address of the head node, which it sees
as an invalid host.  On the head node, the outward-pointing interface has a an
address 10.32.47.10, where as the interface that all of the cluster nodes are
attached to has an address of 10.0.0.1.  Here is a line from the CollectorLog on
the head node:

8/16 14:45:19 (Sending 15 ads in response to query)
8/16 14:45:19 DaemonCore: PERMISSION DENIED to unknown user from host
<10.32.47.10:45781> for command 10 (QUERY_STARTD_PVT_ADS)

I assume this has to do with my user/host authentication settings, but I've been
trying a bunch of stuff to no avail.  In my cluster, all users submit jobs from
the head node.  The nodes do not share a UID domain with the head node, but I
would have thought that it wouldn't matter since what I read seems to indicate
that the jobs will just run as the user nobody on the nodes.  

There are no SEC_ variables set in any of the config files.  This is default.  I
assume that this is the problem, but it's very difficult to debug when the
variables are not mentioned by default and the manual doesn't mention anything
about the requirements of the presence of these variables, or what the assumed
defaults are if they are not present.  

Initially I was getting the following error when I tried submitting a job to a
specific node:

~> condor_submit pi2.cwd -n node1
Submitting job(s)
ERROR: Failed to connect to queue manager node1.cluster
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:24).  There
is probably a problem with your credentials.  (Did you run grid-proxy-init?)
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS

After adding these lines to the global condor_config:

SEC_DEFAULT_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = ANONYMOUS
SEC_CLIENT_AUTHENTICATION_METHODS = ANONYMOUS

I now get the following error:

~> condor_submit pi2.cwd -n node1
Submitting job(s)
ERROR: Failed to set Owner="jrollins" for job 2.0 (13)

ERROR: Failed to queue job.

I guess that's progress, but I'm still a bit confused as to why this has been so
difficult to figure out.

jamie.


On Tue, Aug 16, 2005 at 11:09:40AM -0700, Michael Yoder wrote:
> 
> > Hello.  I've been struggling with a problem that is basically
> identical to
> > the one described in this post from last year:
> > 
> > https://lists.cs.wisc.edu/archive/condor-users/pre-2004-
> > June/msg01340.shtml
> > 
> > The problem is that I can submit jobs, but whatever jobs are submitted
> are
> > rejected by all available nodes.
> > 
> > My cluster consists of one dual-cpu head node, and three diskless
> client
> > nodes:
> > 
> > The Condor setup is very simple, pretty much default.  The head node
> has
> > the following condo_config.local file:
> > 
> > ------------------------
> > NETWORK_INTERFACE = 10.0.0.1
> > DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
> > ------------------------
> > 
> > and the other nodes are using the
> > <release_dir>/etc/examples/condor_config.local.dedicated.resource file
> > which specifies the DedicatedScheduler as the head node.
> > 
> > I have made a single executable to calculate pi to 10000 digits (which
> > works fine normally), which I am trying to submit with the following 
> > command file:
> 
> > ~> condor_q -analyze
> > Warning:  Found no submitters
> > 
> > -- Submitter: zajos.cluster : <10.0.0.1:44160> : zajos.cluster
> >  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
> > ---
> > 012.000:  Run analysis summary.  Of 5 machines,
> >       0 are rejected by your job's requirements
> >       3 reject your job because of their own requirements
> >       0 match but are serving users with a better priority in the pool
> >       2 match but reject the job for unknown reasons
> >       0 match but will not currently preempt their existing job
> >       0 are available to run your job
> > 
> > 1 jobs; 1 idle, 0 running, 0 held
> > ------------------------
> > 
> > Does any one have any idea what's going wrong.
> 
> Some suggestions:
> - Turn up the level of logging and see what's in the schedd log,
> collector log, and negotiator log.  See 
> 
> http://docs.optena.com/display/CONDOR/How+To+Increase+Debugging+Messages
> 
> This should help track down the 'Found no submitters' error.  The schedd
> ought to be sending information about submitters (users like you that
> have submitted jobs) to the collector, and this information goes to the
> negotiator.  condor_q pulls this info from the negotiator.
> 
> - You say that you have three diskless machines - condor may be thinking
> that they have no disk space, and therefore can't run jobs.  Try
> 'condor_status -l | grep Disk' to see what your machines are
> advertising.
> Try condor_q -l to see your Requirements string and DiskUsage.  There
> probably is a clause like ' && (Disk >= DiskUsage)' in the Requirements,
> and this could be preventing jobs from starting on those machines.
> 
> To disable this safety feature, you'll have to set something like
> 
> Requirements = (Disk >= 0)
> 
> in your submit file.
> 
> Mike Yoder
> Principal Member of Technical Staff
> Ask Mike: http://docs.optena.com
> Direct  : +1.408.321.9000
> Fax     : +1.408.321.9030
> Mobile  : +1.408.497.7597
> yoderm@xxxxxxxxxx
> 
> Optena Corporation
> 2860 Zanker Road, Suite 201
> San Jose, CA 95134
> http://www.optena.com
> 
> 
> 
> > Thanks.
> > 
> > jamie.
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users