[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] newbie negotiator error



Anyone got any suggestions?  I'm stumped.  I'm running condor 7.2.4,
which looks old, but is considered stable.  The bugs fixed in 7.2.5
don't seem to apply to my situation.

My central manager machine is listening on these ports:
tcp        0      0 0.0.0.0:60781           0.0.0.0:*
LISTEN      5604/condor_schedd
tcp        0      0 0.0.0.0:9618            0.0.0.0:*
LISTEN      5602/condor_collect
tcp        0      0 0.0.0.0:40254           0.0.0.0:*
LISTEN      5603/condor_startd
tcp        0      0 0.0.0.0:34757           0.0.0.0:*
LISTEN      5604/condor_schedd
tcp        0      0 0.0.0.0:39368           0.0.0.0:*
LISTEN      5605/condor_negotia
tcp        0      0 0.0.0.0:49417           0.0.0.0:*
LISTEN      5601/condor_master
tcp        0      0 192.168.157.10:59020    192.168.157.10:34757
ESTABLISHED 5605/condor_negotia
tcp        0      0 192.168.157.10:34757    192.168.157.10:59020
ESTABLISHED 5604/condor_schedd
udp        0      0 0.0.0.0:39368           0.0.0.0:*
         5605/condor_negotia
udp        0      0 0.0.0.0:40254           0.0.0.0:*
         5603/condor_startd
udp        0      0 0.0.0.0:49417           0.0.0.0:*
         5601/condor_master
udp        0      0 0.0.0.0:9618            0.0.0.0:*
         5602/condor_collect
udp        0      0 0.0.0.0:34757           0.0.0.0:*
         5604/condor_schedd
udp        0      0 0.0.0.0:60781           0.0.0.0:*
         5604/condor_schedd

Thanks,

Tom

On Sat, Nov 19, 2011 at 11:59 PM, Tom Melendez <tom@xxxxxxxxxxxx> wrote:
> Hi Folks,
>
> I'm a newbie to condor (as in, "today") and I've done the "personal
> condor" tutorial without issue.  I've also reviewed some of the slides
> on the site.  I'm now trying to span my job across multiple machines,
> but can see from the job log that only one is executing it.
>
> On the "other machine", I see this in the error log:
> 11/19 23:46:27 ---------- Started Negotiation Cycle ----------
> 11/19 23:46:27 Phase 1:  Obtaining ads from collector ...
> 11/19 23:46:27   Getting all public ads ...
> 11/19 23:46:27   Sorting 9 ads ...
> 11/19 23:46:27   Getting startd private ads ...
> 11/19 23:46:27 condor_read(): recv() returned -1, errno = 104,
> assuming failure reading 5 bytes from unknown source.
> 11/19 23:46:27 IO: Failed to read packet header
> 11/19 23:46:27 Couldn't fetch ads: communication error
> 11/19 23:46:27 Aborting negotiation cycle
>
> Just a little about my setup to give you some context:
>
> I have two machines (technically, these are VMs):
> condor-server: this is the central manager and has submit, manager and
> execute abilities.
> condor-exec: this has execute and submit abilities
> - both running Ubuntu 10.04, I installed condor via the packages and
> use the start/stop scripts to execute it
> - both machines are on the same subnet and I have entries in
> /etc/hosts that point to each other with FQDNs.
> - I did not set the NO_DNS option, I did set the DEFAULT_DOMAIN_NAME
> option, but I don't think I need it due to the host settings above
> - I tried using the NETWORK_INTERFACE option with the IP of the
> condor-exec VM with no luck
> - both machines are running all of the same daemons.  This is contrary
> to some of the docs I've seen online (seems like
> - the allow read and allow write options in the condor_config on both
> machines is set to *
> - the condor_host var in condor-exec points to the hostname of the condor-server
> - condor_status can see both machines (the slots are all "unclaimed")
> - condor_q on condor-server shows the jobs, condor_q on condor-exec does not
> - I have no file I/O.  At this point, I'm just using the simple.c
> example from here:
> http://research.cs.wisc.edu/condor/tutorials/cw2005-condor/submit_first.html
>
> Any ideas suggestions greatly appreciated.
>
> Thanks,
>
> Tom
>