[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] Network in Linux-Cluster and MPI



First thing, unrelated to your specific problem is that you should
configure your Condor hosts to have fully qualified name, and not only
short name. It looks like it is not the case, looking on condor_status
output. 
Now, you should remove 0.0.0.0 from all, I repeat, _all_ condor
configuration files, since as you see, Condor does not know who is
0.0.0.0 and how to contact him. Condor can not work as router for you. (
quite unfortunately, I'd say, but this is what you have)
As I mentioned, you should configure your host "ipc654" as a router. See
linux howtos on this (www.tldp.org). You do not use private addresses,
so it obviously means that all your computers have valid IP addresses. 
Check regular communications from Linux to SUN and back (ping )
And finally, reconfigure Condor on this gateway to be a simple worker
host (running startd alone), listening on one of the interfaces. setup
matchmaker host somewhere, either on SUN or on LINUX and it will receive
all the updates automatically, as your router would allow this.

By the way, if you would have a mixed cluster, one with private network
(all these reserved IPs) and another with public, you would be in
trouble. 

Mark
 
On Mon, 2003-10-27 at 19:06, Degi Baatartsogt wrote:
> On Mon, 27 Oct 2003 marks@xxxxxxxxxxxxxxxxxxxxxxx wrote:
> 
> > I think that if all your cluster computers are connected to both networks, it
> > would be enough to use Condor with one of them.
> 
> Our cluster computers connected only into host computer ipc654. And only
> "ipc654" connected to the outside. So only "ipc654" can take contact to
> condor host "isun01".
> 
> > You should put the IP of the interface, which is connected to the network with
> > all computers. For instance, you have 192.168.10.* for all your comps, so you
> > should put, say 192.168.10.1 for the first and so on.
> 
> But i reconfigured as u said here. And now i can see that they can
> communicate with each other. I mean, with the command condor_status i get
> following information.
> Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
> 
> anne          LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
> bine          LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
> carmen        LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
> dana          LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:10
> emma          LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
> franzi        LINUX       INTEL  Owner      Idle       0.060   501  0+00:10:11
> grace         LINUX       INTEL  Owner      Idle       0.000   501  0+00:10:11
> vm1@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.070   503  0+00:15:09
> vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   503  0+00:15:05
> vm1@xxxxxxxxx SOLARIS28   SUN4u  Owner      Idle       0.000   512  0+00:40:07
> vm2@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:05
> vm3@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:06
> vm4@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:07
> vm5@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:08
> vm6@xxxxxxxxx SOLARIS28   SUN4u  Unclaimed  Idle       0.000   512  0+00:40:09
> isun25        SOLARIS28   SUN4u  Unclaimed  Idle       0.086    64  0+00:49:54
> isun26        SOLARIS28   SUN4u  Unclaimed  Idle       0.008    64  0+00:50:04
> isun28        SOLARIS28   SUN4u  Unclaimed  Idle       0.000    64  0+01:50:05
> isun35        SOLARIS28   SUN4u  Unclaimed  Idle       0.000   128  0+03:40:05
> isun09        SOLARIS28   SUN4x  Unclaimed  Idle       0.008    64  0+00:49:02
> isun22        SOLARIS28   SUN4x  Unclaimed  Idle       0.016    64  0+01:35:04
> isun23        SOLARIS28   SUN4x  Unclaimed  Idle       0.004    64  0+01:50:04
> 
>                      Machines Owner Claimed Unclaimed Matched Preempting
> 
>          INTEL/LINUX        9     8       0         1       0          0
>      SUN4u/SOLARIS28       10     1       0         9       0          0
>      SUN4x/SOLARIS28        3     0       0         3       0          0
> 
>                Total       22     9       0        13       0          0
> 
> And now im trying to execute jobs and the jobs are running on the mashine
> where it were submitted, but not on remote mashine. Do u know what is the
> Porblem? Following example is the submit file submitted on "isun01" for
> remote mashine. I have both execute files on "isun01".
> 
> -----------------------------------------------------------
> ################
> #
> # Condor submit file for simple test job example
> #
> ################
> 
> Universe        = vanilla
> Executable      = hello.$$(OpSys).$$(Arch)
> 
> Requirements    =  (Arch == "INTEL" && OpSys == "LINUX")
> 
> tranfer_files = ALWAYS
> 
> input           = /dev/null
> output          = het.out
> error           = het.error
> log             = het.log
> 
> Queue
> -----------------------------------------------------------
> 
> Log files on "isun01" after executing the job 78.0 on "isun01"
> -------------------------------------------------------------
> 
> ==> condor/hosts/isun01/log/NegotiatorLog <==
> 10/27 17:55:46 Connect failed for 10 seconds; returning FALSE
> 10/27 17:55:46     Failed to connect to <0.0.0.0:33493>
> 10/27 17:55:46   Error: Ignoring schedd for this cycle
> 10/27 17:55:46   Negotiating with baatarts@xxxxxxxxxxxxxxx at <141.35.14.22:55627>
> 10/27 17:55:46     Request 00078.00000:
> 10/27 17:55:46       Matched 78.0 baatarts@xxxxxxxxxxxxxxx <141.35.14.22:55627> preempting none <0.0.0.0:33497>
> 10/27 17:55:46       Successfully matched with dana
> 10/27 17:55:46     Got NO_MORE_JOBS;  done negotiating
> 10/27 17:55:46 ---------- Finished Negotiation Cycle ----------
> 
> ==> condor/hosts/isun01/log/SchedLog <==
> 10/27 17:55:46 Activity on stashed negotiator socket
> 10/27 17:55:46 Negotiating for owner: baatarts@xxxxxxxxxxxxxxx
> 10/27 17:55:46 Checking consistency running and runnable jobs
> 10/27 17:55:46 Tables are consistent
> 10/27 17:55:46 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
> 10/27 17:55:46 Sent ad to central manager for baatarts@xxxxxxxxxxxxxxx
> 10/27 17:55:46 Can't connect to <0.0.0.0:33497>:0, errno = 146
> 10/27 17:55:46 Will keep trying for 10 seconds...
> 10/27 17:55:56 Connect failed for 10 seconds; returning FALSE
> 10/27 17:55:56 Couldn't send REQUEST_CLAIM to startd at <0.0.0.0:33497>
> 10/27 17:55:56 Sent RELEASE_CLAIM to startd on <0.0.0.0:33497>
> 10/27 17:55:56 Match record (<0.0.0.0:33497>, 78, 0) deleted
> 
> ==> condor/hosts/isun01/log/MatchLog <==
> 10/27 17:55:46       Matched 78.0 baatarts@xxxxxxxxxxxxxxx
> <141.35.14.22:55627> preempting none <0.0.0.0:33497>
> 
> ==> condor/hosts/isun01/log/CollectorLog <==
> 10/27 17:55:54 (Sent 59 ads in response to query)
> 10/27 17:55:54 DaemonCore: PERMISSION DENIED to unknown user from host
> <141.35.14.189:34481> for command 10 (QUERY_STARTD_PVT_ADS)
> 
> 
> 
> Log file on "ipc654" after executing the job 78.0 on "isun01"
> -------------------------------------------------------------
> 
> ----0/27 17:50:54 ---------- Started Negotiation Cycle ----------
> 10/27 17:50:54 Phase 1:  Obtaining ads from collector ...
> 10/27 17:50:54   Getting all public ads ...
> 10/27 17:50:54   Sorting 56 ads ...
> 10/27 17:50:54   Getting startd private ads ...
> 10/27 17:50:54 Couldn't fetch ads: communication error
> 10/27 17:50:54 Aborting negotiation cycle
> 
> 
> > If you have two NON-interconnected networks of SUN and LINUX computers, you
> > should setup a gateway as a router, which would forward packets from SUN to
> > Linux and back in a transparent manner(from the application point of view), and
> > afterwards setup Condor to be on that network, as specified above.
> > Mark
> >
> > Quoting Degi Baatartsogt <baatarts@xxxxxxxxxxxxxxxxx>:
> >
> > >
> > > Hi Mark,
> > >
> > > my problem is, that we have here Linux-Cluster (Beowulf). So our Linux
> > > Host has two Interfaces. Thatsway i'm trying to use NETWORK_INTEFACE.  I'm
> > > not sure what kind of address i should use. But i tried all possibilities.
> > > But as i understand, we cant solve this problem till we get the source
> > > codes. Is that right?
> > >
> > > On 23 Oct 2003, Mark Silberstein wrote:
> > >
> > > > Well, I would not mix these two things.
> > > > Why do you use 0.0.0.0 settings for NETWORK_INTERFACE? If you have Linux
> > > > and SUN pools connected in any way via network, you should not need to
> > > > configure Condor to listen on more than one NW interface. Can you be
> > > > more specific about your network topology to understand this?
> > > > I expect that you would get the same communication problem for whatever
> > > > job you run, since ALL Condor communications would not work with
> > > > NETWORK_INTERFACE parameter set to 0.0.0.0
> > > >
> > > >
> > > > On Sun, 2003-10-19 at 17:10, Degi Baatartsogt wrote:
> > > > > Hi Mark,
> > > > >
> > > > > thank you for your response!
> > > > >
> > > > > > Sorry, from our experience this won't work. Condor can't really
> > > listen
> > > > > > on more than one NW interface, at least we did not succeed. If
> > > someone
> > > > > > from the team knows the answer, please share it with us!
> > > > > > Mark
> > > > >
> > > > > Does it mean, that MPI-Condor-Jobs would'nt work on Cluster? Because i
> > > get
> > > > > also the same Communication Problem if i submit MPI-MPICH job on Condor
> > > in
> > > > > our Cluster.
> > > > >
> > > > > Degi
> > > > >
> > > > > > On Wed, 2003-10-15 at 14:58, Degi Baatartsogt wrote:
> > > > > > > Hello everybody,
> > > > > > >
> > > > > > > I'm trying to use flocking between Sun pool and Linux pool. For that
> > > i
> > > > > > > changed flocking paramenter in both direction and put
> > > NETWORK_INTERFACE in
> > > > > > > 0.0.0.0 in global config file. And now i get following messages in
> > > Log
> > > > > > > files. Does anybody know, what should i do?
> > > > > > >
> > > > > > > Thank you
> > > > > > > Ms Baatartsogt
> > > > > > >
> > > > > > > ==> SchedLog <==
> > > > > > > 10/15 12:37:59 DaemonCore: Command received via UDP from host
> > > <127.0.0.1:yyyyy>
> > > > > > > 10/15 12:37:59 DaemonCore: received command 421 (RESCHEDULE),
> > > calling
> > > > > > >                handler (reschedule_negotiator)
> > > > > > > 10/15 12:37:59 Sent ad to central manager for
> > > condor@xxxxxxxxxxxxxxxxxx
> > > > > > > 10/15 12:37:59 Called reschedule_negotiator()
> > > > > > > 10/15 12:37:59 DaemonCore: PERMISSION DENIED to unknown user from
> > > host
> > > > > > >                <127.0.0.1:xxxxx> for command 416 (NEGOTIATE)
> > > > > > >
> > > > > > > ==> CollectorLog <==
> > > > > > > 10/15 12:38:05 DC_AUTHENTICATE: attempt to open invalid session
> > > ipc654:15713:106
> > > > > > > 6213385:334, failing.
> > > > > > > 10/15 12:38:12 DC_AUTHENTICATE: attempt to open invalid session
> > > ipc654:15713:106
> > > > > > > 6213692:349, failing.
> > > > > > > 10/15 12:38:17 DC_AUTHENTICATE: attempt to open invalid session
> > > ipc654:15713:106
> > > > > > > ...
> > > > > > > Condor Support Information:
> > > > > > > http://www.cs.wisc.edu/condor/condor-support/
> > > > > > > To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> > > > > > > unsubscribe condor-users <your_email_address>
> >
> 
> --------------------------------------
> | Baatartsogt, O                       |
> | University of Jena, Germany          |
>  --------------------------------------
> 
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>