Re: [Condor-users] jobs are only running at condor

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab
On 8/29/05, Narunjan Kumar <naranjan@xxxxxxxxx> wrote:

On 8/29/05, Nick LeRoy <nleroy@xxxxxxxxxxx> wrote:
> On Monday 29 August 2005 1:27 pm, Narunjan Kumar wrote:

> > Hello
> Hello,
>
> > i have setup a condor pool of two machines.

> > 1st is condor master
> > 2nd is slave node.
 
> FYI, in Condor land we don't use the term "master" to refer to a machine; it

> causes confusion with the condor_master.  I believe that the terms you are

> looking for are "central manager" (which you do use below), "execute machine"
> and "submit machine".


yes you are right.


> > when i submit the jobs  through condor master  it runs but at
> > condor master machine.

> > jobs donot go any other machine even the machine are idle.
>
> In other words, jobs submitted from the Central Manager (which is apparently

> also functioning as a submit host, and, possibly, execute host) do run.  Is

> that correct?

yes correct. (but jobs only runs on same host where it was submitted. 
i.e CM.  Even i submit 10 jobs it runs on CM one by one and 2nd host **.26.146.226 remains idle all the time )



> > when i submit the jobs with the 2nd machine they remains idle in the

> > Que and never runs even on the same machine .
>
> Jobs submitted from the other submit hosts do not run.  is that correct?


yes. this is right.





> > in either case i have found same error message in
>
> > ---------- Started Negotiation Cycle ----------

> > 8/29 20:16:45 Phase 1:  Obtaining ads from collector ...
> > 8/29 20:16:45   Getting all public ads ...

> > 8/29 20:16:45   Sorting 7 ads ...
> > 8/29 20:16:45   Getting startd private ads ...

> > 8/29 20:16:45 Got ads: 7 public and 2 private
> > 8/29 20:16:45 Public ads include 1 submitter, 2 startd

> > 8/29 20:16:45 Phase 2:  Performing accounting ...
> > 8/29 20:16:45 Phase 3:  Sorting submitter ads by priority ...

> > 8/29 20:16:45 Phase 4.1:  Negotiating with schedds ...
> > 8/29 20:16:45   Negotiating with 

condor@xxxxxxxxxxxxxxxxxxxxxxx at
> > <**.26.146.226:1173>
> > 8/29 20:17:15 select returns 0, connect failed
> > 8/29 20:17:15 Will keep trying for 30 seconds...

> > 8/29 20:17:16 Connect failed for 30 seconds; returning FALSE

> > 8/29 20:17:16     Failed to connect to <**.26.146.226:1173>
> > 8/29 20:17:16   Error: Ignoring schedd for this cycle

> > 8/29 20:17:16 ---------- Finished Negotiation Cycle ----------

>
> It would be very useful to see what's in the SchedLog on **.26.146.226 (which
> I assume to be the second host).


here it is 

condor_schedd

8/29 11:57:25 ******************************************************


8/29 11:57:25 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
8/29 11:57:25 ** /home/condor/condor/sbin/condor_schedd


8/29 11:57:25 ** $CondorVersion: 6.6.10 Jun 13 2005 $
8/29 11:57:25 ** $CondorPlatform: I386-LINUX_RH9 $


8/29 11:57:25 ** PID = 10524
8/29 11:57:25 ******************************************************


8/29 11:57:25 Using config file: /home/condor/condor_config

8/29 11:57:25 Using local config files: /home/condor/condor/etc/grid6.local

8/29 11:57:25 DaemonCore: Command Socket at <**.26.146.226:1173>


8/29 11:57:25 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxx
-------------

8/29 11:59:15 DaemonCore: Command received via TCP from host <**.26.146.226:1181>

8/29 11:59:15 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)


8/29 11:59:26 DaemonCore: Command received via UDP from host <**.26.146.226:1030>

8/29 11:59:26 DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)

8/29 11:59:26 Sent ad to central manager for 

condor@xxxxxxxxxxxxxxxxxxx
8/29 11:59:26 Called reschedule_negotiator()


8/29 12:04:26 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx


8/29 12:09:26 Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxx





> > what is  the problem here
> > why the central manger is unable to connect with  other  machine nodes

> > in the pool.
> > if I see the condor_status then it shows both computer in the list

>
> Do jobs *run* on the second host?  When you sumbit 2 jobs from the CM and run
> 'condor_status' do they both switch to "claimed/busy"?  Is there anything in

> any of the logs on the second host about "permission denied" or similar?  If

> so, you should review "3.7 Security In Condor" of the Condor manual.


only CM (1st host is executing the jobs and it switch to "claimed/busy"  and 2nd host remains idle).

i have check the files i havenot found any permission denied problem yet.



authentication problem occures when i try to submit the job from 2nd host to CM by using fllowing command


condor_submit Count.submit -n 
grid6.my.domain.com


8/29 21:26:17 SCHEDD: authentication failed

8/29 21:26:17 AUTHENTICATE:1003:Failed to authenticate with any method

AUTHENTICATE:1004:Failed to authenticate using GSI

GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.


AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS


8/29 21:26:44 AUTHENTICATE: no available authentication methods succeeded, failing!


8/29 21:26:44 SCHEDD: authentication failed
8/29 21:26:44 AUTHENTICATE:1003:Failed to authenticate with any method


AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5002:Failed to authenticate because the remote (client) side was not able to acquire its credentials.


AUTHENTICATE:1004:Failed to authenticate using KERBEROS
AUTHENTICATE:1004:Failed to authenticate using FS


8/29 21:27:29 DaemonCore: Command received via UDP from host <**.26.146.224:33449>


8/29 21:27:29 DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator)
8/29 21:27:30 Sent ad to central manager for 

condor@xxxxxxxxxxxxxxxxxxx

8/29 21:27:30 Called reschedule_negotiator()

8/29 21:27:30 DaemonCore: Command received via TCP from host <**.26.146.224:33447>

8/29 21:27:30 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)


8/29 21:27:30 Negotiating for owner: condor@xxxxxxxxxxxxxxxxxxx


8/29 21:27:30 Checking consistency running and runnable jobs
8/29 21:27:30 Tables are consistent


8/29 21:27:30 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0

8/29 21:27:32 Started shadow for job 
39.0 on "<**.26.146.224:33371>", (shadow pid = 9351)
8/29 21:27:34 Sent ad to central manager for 

condor@xxxxxxxxxxxxxxxxxxx

8/29 21:28:54 Shadow pid 9351 for job 39.0 exited with status 100

8/29 21:28:54 match (<**.26.146.224:33371>#**2451276) out of jobs (cluster id 39); relinquishing

8/29 21:28:54 Sent RELEASE_CLAIM to startd on <**.26.146.224:33371>


8/29 21:28:54 Match record (<**.26.146.224:33371>, 39, -1) deleted

8/29 21:28:54 DaemonCore: Command received via TCP from host <**.26.146.224:33464>

8/29 21:28:54 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)


8/29 21:28:54 Got VACATE_SERVICE from <**.26.146.224:33464>

8/29 21:32:34 Sent owner (0 jobs) ad to central manager




Note:
I m sharing /home/condor/condor/ thorugh NFS among all hosts.


i dont have any shared pwd file.

> I think that we'll need answers to some of these questions before we can

> proceed much further...

>
> Hope this helps,
>
Thanks
> -Nick
Narunjan
> --
>            <<< Why, oh, why, didn't I take the blue pill? >>>
>  /`-_    Nicholas R. LeRoy               The Condor Project


> {     }/ http://www.cs.wisc.edu/~nleroy  
http://www.cs.wisc.edu/condor
>  \    /  
nleroy@xxxxxxxxxxx              The University of Wisconsin

>  |_*_|   608-265-5761                    Department of Computer Sciences
> _______________________________________________
> Condor-users mailing list

> 
Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users

>
Mailing List Archives

Public Access

Re: [Condor-users] jobs are only running at condor_master machine