[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs are only running at condor_master machine



one thing else which i didnot put in my lat mail
i also tried
/home/condor/condor/bin/condor_q  -analyze

 Run analysis summary.  Of 2 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match, but are serving users with a better priority in the pool
      2 match, match, but reject the job for unknown reasons
      0 match, but will not currently preempt their existing job
      0 are available to run your job
it might help you  to solve the problem

Thanks in advance
Narunjan

On 8/29/05, Narunjan Kumar <naranjan@xxxxxxxxx> wrote:

On 8/29/05, Nick LeRoy <nleroy@xxxxxxxxxxx> wrote:
> On Monday 29 August 2005 1:27 pm, Narunjan Kumar wrote:
> > Hello
> Hello,
>
> > i have setup a condor pool of two machines.

> > 1st is condor master
> > 2nd is slave node.

> FYI, in Condor land we don't use the term "master" to refer to a machine; it
> causes confusion with the condor_master. I believe that the terms you are

> looking for are "central manager" (which you do use below), "execute machine"
> and "submit machine".

yes you are right.


> > when i submit the jobs through condor master it runs but at
> > condor master machine.
> > jobs donot go any other machine even the machine are idle.
>
> In other words, jobs submitted from the Central Manager (which is apparently

> also functioning as a submit host, and, possibly, execute host) do run. Is
> that correct?

yes correct. (but jobs only runs on same host where it was submitted.
i.e CM. Even i submit 10 jobs it runs on CM one by one and 2nd host **.26.146.226 remains idle all the time )



> > when i submit the jobs with the 2nd machine they remains idle in the

> > Que and never runs even on the same machine .
>
> Jobs submitted from the other submit hosts do not run. is that correct?

yes. this is right.





> > in either case i have found same error message in
>
> > ---------- Started Negotiation Cycle ----------
> > 8/29 20:16:45 Phase 1: Obtaining ads from collector ...
> > 8/29 20:16:45 Getting all public ads ...

> > 8/29 20:16:45 Sorting 7 ads ...
> > 8/29 20:16:45 Getting startd private ads ...
> > 8/29 20:16:45 Got ads: 7 public and 2 private
> > 8/29 20:16:45 Public ads include 1 submitter, 2 startd

> > 8/29 20:16:45 Phase 2: Performing accounting ...
> > 8/29 20:16:45 Phase 3: Sorting submitter ads by priority ...
> > 8/29 20:16:45 Phase 4.1: Negotiating with schedds ...
> > 8/29 20:16:45 Negotiating with
condor@xxxxxxxxxxxxxxxxxxxxxxx at
> > <**.26.146.226:1173>
> > 8/29 20:17:15 select returns 0, connect failed
> > 8/29 20:17:15 Will keep trying for 30 seconds...

> > 8/29 20:17:16 Connect failed for 30 seconds; returning FALSE
> > 8/29 20:17:16 Failed to connect to <**.26.146.226:1173>
> > 8/29 20:17:16 Error: Ignoring schedd for this cycle

> > 8/29 20:17:16 ---------- Finished Negotiation Cycle ----------
>
> It would be very useful to see what's in the SchedLog on **.26.146.226 (which
> I assume to be the second host).


here it is

condor_schedd

8/29 11:57:25 ******************************************************

8/29 11:57:25 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
8/29 11:57:25 ** /home/condor/condor/sbin/condor_schedd

8/29 11:57:25 ** $CondorVersion: 6.6.10 Jun 13 2005 $
8/29 11:57:25 ** $CondorPlatform: I386-LINUX_RH9 $

8/29 11:57:25 ** PID = 10524
8/29 11:57:25 ******************************************************

8/29 11:57:25 Using config file: /home/condor/condor_config
8/29 11:57:25 Using local config files: /home/condor/condor/etc/grid6.local

8/29 11:57:25 DaemonCore: Command Socket at <**.26.146.226:1173>

8/29 11:57:25 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx

-------------

8/29 11:59:15 DaemonCore: Command received via TCP from host <**.26.146.226:1181>
8/29 11:59:15 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)

8/29 11:59:26 DaemonCore: Command received via UDP from host <**.26.146.226:1030>

8/29 11:59:26 DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)

8/29 11:59:26 Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxx

8/29 11:59:26 Called reschedule_negotiator()

8/29 12:04:26 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx

8/29 12:09:26 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx





> > what is the problem here
> > why the central manger is unable to connect with other machine nodes

> > in the pool.
> > if I see the condor_status then it shows both computer in the list
>
> Do jobs *run* on the second host? When you sumbit 2 jobs from the CM and run
> 'condor_status' do they both switch to "claimed/busy"? Is there anything in

> any of the logs on the second host about "permission denied" or similar? If
> so, you should review "3.7 Security In Condor" of the Condor manual.


only CM (1st host is executing the jobs and it switch to "claimed/busy" and 2nd host remains idle).

i have check the files i havenot found any permission denied problem yet.


authentication problem occures when i try to submit the job from 2nd host to CM by using fllowing command

condor_submit Count.submit -n grid6.my.domain.com


8/29 21:26:17 SCHEDD: authentication failed
8/29 21:26:17 AUTHENTICATE:1003:Failed to authenticate with any method

AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.

AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS

8/29 21:26:44 AUTHENTICATE: no available authentication methods succeeded, failing!

8/29 21:26:44 SCHEDD: authentication failed

8/29 21:26:44 AUTHENTICATE:1003:Failed to authenticate with any method

AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.

AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS

8/29 21:27:29 DaemonCore: Command received via UDP from host <**.26.146.224:33449>

8/29 21:27:29 DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)

8/29 21:27:30 Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxx

8/29 21:27:30 Called reschedule_negotiator()

8/29 21:27:30 DaemonCore: Command received via TCP from host <**.26.146.224:33447>
8/29 21:27:30 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)

8/29 21:27:30 Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxx

8/29 21:27:30 Checking consistency running and runnable jobs
8/29 21:27:30 Tables are consistent

8/29 21:27:30 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
8/29 21:27:32 Started shadow for job
39.0 on "<**.26.146.224:33371>", (shadow pid = 9351)

8/29 21:27:34 Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxx

8/29 21:28:54 Shadow pid 9351 for job 39.0 exited with status 100

8/29 21:28:54 match (<**.26.146.224:33371>#**2451276) out of jobs (cluster id 39); relinquishing

8/29 21:28:54 Sent RELEASE_CLAIM to startd on <**.26.146.224:33371>

8/29 21:28:54 Match record (<**.26.146.224:33371>, 39, -1) deleted

8/29 21:28:54 DaemonCore: Command received via TCP from host <**.26.146.224:33464>

8/29 21:28:54 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)

8/29 21:28:54 Got VACATE_SERVICE from <**.26.146.224:33464>

8/29 21:32:34 Sent owner (0 jobs) ad to central manager




Note:
I m sharing /home/condor/condor/ thorugh NFS among all hosts.

i dont have any shared pwd file.

> I think that we'll need answers to some of these questions before we can

> proceed much further...
>
> Hope this helps,
>
Thanks
> -Nick
Narunjan
> --
> <<< Why, oh, why, didn't I take the blue pill? >>>
> /`-_ Nicholas R. LeRoy The Condor Project

> { }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
> \ /
nleroy@xxxxxxxxxxx
The University of Wisconsin
> |_*_| 608-265-5761 Department of Computer Sciences
> _______________________________________________
> Condor-users mailing list

> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>