[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs are only running at condor_master machine




On 8/29/05, Nick LeRoy <nleroy@xxxxxxxxxxx> wrote:
> On Monday 29 August 2005 1:27 pm, Narunjan Kumar wrote:
> > Hello
> Hello,
>
> > i have setup a condor pool of two machines.
> > 1st is condor master
> > 2nd is slave node.

> FYI, in Condor land we don't use the term "master" to refer to a machine; it
> causes confusion with the condor_master. I believe that the terms you are
> looking for are "central manager" (which you do use below), "execute machine"
> and "submit machine".

yes you are right.

> > when i submit the jobs through condor master it runs but at
> > condor master machine.
> > jobs donot go any other machine even the machine are idle.
>
> In other words, jobs submitted from the Central Manager (which is apparently
> also functioning as a submit host, and, possibly, execute host) do run. Is
> that correct?

yes correct. (but jobs only runs on same host where it was submitted. i.e CM. Even i submit 10 jobs it runs on CM one by one and 2nd host **.26.146.226 remains idle all the time )


> > when i submit the jobs with the 2nd machine they remains idle in the
> > Que and never runs even on the same machine .
>
> Jobs submitted from the other submit hosts do not run. is that correct?

yes. this is right.




> > in either case i have found same error message in
>
> > ---------- Started Negotiation Cycle ----------
> > 8/29 20:16:45 Phase 1: Obtaining ads from collector ...
> > 8/29 20:16:45 Getting all public ads ...
> > 8/29 20:16:45 Sorting 7 ads ...
> > 8/29 20:16:45 Getting startd private ads ...
> > 8/29 20:16:45 Got ads: 7 public and 2 private
> > 8/29 20:16:45 Public ads include 1 submitter, 2 startd
> > 8/29 20:16:45 Phase 2: Performing accounting ...
> > 8/29 20:16:45 Phase 3: Sorting submitter ads by priority ...
> > 8/29 20:16:45 Phase 4.1: Negotiating with schedds ...
> > 8/29 20:16:45 Negotiating with condor@xxxxxxxxxxxxxxxxxxxxxxx at
> > <**.26.146.226:1173>
> > 8/29 20:17:15 select returns 0, connect failed
> > 8/29 20:17:15 Will keep trying for 30 seconds...
> > 8/29 20:17:16 Connect failed for 30 seconds; returning FALSE
> > 8/29 20:17:16 Failed to connect to <**.26.146.226:1173>
> > 8/29 20:17:16 Error: Ignoring schedd for this cycle
> > 8/29 20:17:16 ---------- Finished Negotiation Cycle ----------
>
> It would be very useful to see what's in the SchedLog on **.26.146.226 (which
> I assume to be the second host).

here it is
condor_schedd

8/29 11:57:25 ******************************************************
8/29 11:57:25 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
8/29 11:57:25 ** /home/condor/condor/sbin/condor_schedd
8/29 11:57:25 ** $CondorVersion: 6.6.10 Jun 13 2005 $
8/29 11:57:25 ** $CondorPlatform: I386-LINUX_RH9 $
8/29 11:57:25 ** PID = 10524
8/29 11:57:25 ******************************************************
8/29 11:57:25 Using config file: /home/condor/condor_config
8/29 11:57:25 Using local config files: /home/condor/condor/etc/grid6.local
8/29 11:57:25 DaemonCore: Command Socket at <**.26.146.226:1173>
8/29 11:57:25 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx
-------------
8/29 11:59:15 DaemonCore: Command received via TCP from host <**.26.146.226:1181>
8/29 11:59:15 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
8/29 11:59:26 DaemonCore: Command received via UDP from host <**.26.146.226:1030>
8/29 11:59:26 DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
8/29 11:59:26 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx
8/29 11:59:26 Called reschedule_negotiator()
8/29 12:04:26 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx
8/29 12:09:26 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx




> > what is the problem here
> > why the central manger is unable to connect with other machine nodes
> > in the pool.
> > if I see the condor_status then it shows both computer in the list
>
> Do jobs *run* on the second host? When you sumbit 2 jobs from the CM and run
> 'condor_status' do they both switch to "claimed/busy"? Is there anything in
> any of the logs on the second host about "permission denied" or similar? If
> so, you should review "3.7 Security In Condor" of the Condor manual.

only CM (1st host is executing the jobs and it switch to "claimed/busy" and 2nd host remains idle).
i have check the files i havenot found any permission denied problem yet.

authentication problem occures when i try to submit the job from 2nd host to CM by using fllowing command

condor_submit Count.submit -n grid6.my.domain.com

8/29 21:26:17 SCHEDD: authentication failed
8/29 21:26:17 AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
8/29 21:26:44 AUTHENTICATE: no available authentication methods succeeded, failing!
8/29 21:26:44 SCHEDD: authentication failed
8/29 21:26:44 AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.
AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS
8/29 21:27:29 DaemonCore: Command received via UDP from host <**.26.146.224:33449>
8/29 21:27:29 DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
8/29 21:27:30 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx
8/29 21:27:30 Called reschedule_negotiator()
8/29 21:27:30 DaemonCore: Command received via TCP from host <**.26.146.224:33447>
8/29 21:27:30 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)
8/29 21:27:30 Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxx
8/29 21:27:30 Checking consistency running and runnable jobs
8/29 21:27:30 Tables are consistent
8/29 21:27:30 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
8/29 21:27:32 Started shadow for job 39.0 on "<**.26.146.224:33371>", (shadow pid = 9351)
8/29 21:27:34 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx
8/29 21:28:54 Shadow pid 9351 for job 39.0 exited with status 100
8/29 21:28:54 match (<**.26.146.224:33371>#**2451276) out of jobs (cluster id 39); relinquishing
8/29 21:28:54 Sent RELEASE_CLAIM to startd on <**.26.146.224:33371>
8/29 21:28:54 Match record (<**.26.146.224:33371>, 39, -1) deleted
8/29 21:28:54 DaemonCore: Command received via TCP from host <**.26.146.224:33464>
8/29 21:28:54 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
8/29 21:28:54 Got VACATE_SERVICE from <**.26.146.224:33464>
8/29 21:32:34 Sent owner (0 jobs) ad to central manager



Note:
I m sharing /home/condor/condor/ thorugh NFS among all hosts.
i dont have any shared pwd file.

> I think that we'll need answers to some of these questions before we can
> proceed much further...
>
> Hope this helps,
>
Thanks
> -Nick
Narunjan
> --
> <<< Why, oh, why, didn't I take the blue pill? >>>
> /`-_ Nicholas R. LeRoy The Condor Project
> { }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
> \ / nleroy@xxxxxxxxxxx The University of Wisconsin
> |_*_| 608-265-5761 Department of Computer Sciences
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>