[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs are only running at condor_master machine



On Monday 29 August 2005 1:27 pm, Narunjan Kumar wrote:
> Hello
Hello,

> i have setup a condor pool of two machines.
> 1st is condor master
> 2nd is slave node.

FYI, in Condor land we don't use the term "master" to refer to a machine; it 
causes confusion with the condor_master.  I believe that the terms you are 
looking for are "central manager" (which you do use below), "execute machine" 
and "submit machine".

> when i submit the jobs  through condor master  it runs but at at
> condor master machine.
> jobs donot go any other machine even the machine are idle.

In other words, jobs submitted from the Central Manager (which is apparently 
also functioning as a submit host, and, possibly, execute host) do run.  Is 
that correct?

> when i submit the jobs with the 2nd machine they remains idle in the
> Que and never runs even on the same machine .

Jobs submitted from the other submit hosts do not run.  is that correct?

> in either case i have found same error message in

> ---------- Started Negotiation Cycle ----------
> 8/29 20:16:45 Phase 1:  Obtaining ads from collector ...
> 8/29 20:16:45   Getting all public ads ...
> 8/29 20:16:45   Sorting 7 ads ...
> 8/29 20:16:45   Getting startd private ads ...
> 8/29 20:16:45 Got ads: 7 public and 2 private
> 8/29 20:16:45 Public ads include 1 submitter, 2 startd
> 8/29 20:16:45 Phase 2:  Performing accounting ...
> 8/29 20:16:45 Phase 3:  Sorting submitter ads by priority ...
> 8/29 20:16:45 Phase 4.1:  Negotiating with schedds ...
> 8/29 20:16:45   Negotiating with condor@xxxxxxxxxxxxxxxxxxxxxxx at
> <**.26.146.226:1173>
> 8/29 20:17:15 select returns 0, connect failed
> 8/29 20:17:15 Will keep trying for 30 seconds...
> 8/29 20:17:16 Connect failed for 30 seconds; returning FALSE
> 8/29 20:17:16     Failed to connect to <**.26.146.226:1173>
> 8/29 20:17:16   Error: Ignoring schedd for this cycle
> 8/29 20:17:16 ---------- Finished Negotiation Cycle ----------

It would be very useful to see what's in the SchedLog on **.26.146.226 (which 
I assume to be the second host).

> what is  the problem here
> why the central manger is unable to connect with  other  machine nodes
> in the pool.
> if I see the condor_status then it shows both computer in the list

Do jobs *run* on the second host?  When you sumbit 2 jobs from the CM and run 
'condor_status' do they both switch to "claimed/busy"?  Is there anything in 
any of the logs on the second host about "permission denied" or similar?  If 
so, you should review "3.7 Security In Condor" of the Condor manual.

I think that we'll need answers to some of these questions before we can 
proceed much further...

Hope this helps,

-Nick

-- 
           <<< Why, oh, why, didn't I take the blue pill? >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences