[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] newbie negotiator error
- Date: Sun, 20 Nov 2011 15:42:08 -0800
- From: Tom Melendez <tom@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] newbie negotiator error
Anyone got any suggestions? I'm stumped. I'm running condor 7.2.4,
which looks old, but is considered stable. The bugs fixed in 7.2.5
don't seem to apply to my situation.
My central manager machine is listening on these ports:
tcp 0 0 0.0.0.0:60781 0.0.0.0:*
tcp 0 0 0.0.0.0:9618 0.0.0.0:*
tcp 0 0 0.0.0.0:40254 0.0.0.0:*
tcp 0 0 0.0.0.0:34757 0.0.0.0:*
tcp 0 0 0.0.0.0:39368 0.0.0.0:*
tcp 0 0 0.0.0.0:49417 0.0.0.0:*
tcp 0 0 192.168.157.10:59020 192.168.157.10:34757
tcp 0 0 192.168.157.10:34757 192.168.157.10:59020
udp 0 0 0.0.0.0:39368 0.0.0.0:*
udp 0 0 0.0.0.0:40254 0.0.0.0:*
udp 0 0 0.0.0.0:49417 0.0.0.0:*
udp 0 0 0.0.0.0:9618 0.0.0.0:*
udp 0 0 0.0.0.0:34757 0.0.0.0:*
udp 0 0 0.0.0.0:60781 0.0.0.0:*
On Sat, Nov 19, 2011 at 11:59 PM, Tom Melendez <tom@xxxxxxxxxxxx> wrote:
> Hi Folks,
> I'm a newbie to condor (as in, "today") and I've done the "personal
> condor" tutorial without issue. I've also reviewed some of the slides
> on the site. I'm now trying to span my job across multiple machines,
> but can see from the job log that only one is executing it.
> On the "other machine", I see this in the error log:
> 11/19 23:46:27 ---------- Started Negotiation Cycle ----------
> 11/19 23:46:27 Phase 1: Obtaining ads from collector ...
> 11/19 23:46:27 Getting all public ads ...
> 11/19 23:46:27 Sorting 9 ads ...
> 11/19 23:46:27 Getting startd private ads ...
> 11/19 23:46:27 condor_read(): recv() returned -1, errno = 104,
> assuming failure reading 5 bytes from unknown source.
> 11/19 23:46:27 IO: Failed to read packet header
> 11/19 23:46:27 Couldn't fetch ads: communication error
> 11/19 23:46:27 Aborting negotiation cycle
> Just a little about my setup to give you some context:
> I have two machines (technically, these are VMs):
> condor-server: this is the central manager and has submit, manager and
> execute abilities.
> condor-exec: this has execute and submit abilities
> - both running Ubuntu 10.04, I installed condor via the packages and
> use the start/stop scripts to execute it
> - both machines are on the same subnet and I have entries in
> /etc/hosts that point to each other with FQDNs.
> - I did not set the NO_DNS option, I did set the DEFAULT_DOMAIN_NAME
> option, but I don't think I need it due to the host settings above
> - I tried using the NETWORK_INTERFACE option with the IP of the
> condor-exec VM with no luck
> - both machines are running all of the same daemons. This is contrary
> to some of the docs I've seen online (seems like
> - the allow read and allow write options in the condor_config on both
> machines is set to *
> - the condor_host var in condor-exec points to the hostname of the condor-server
> - condor_status can see both machines (the slots are all "unclaimed")
> - condor_q on condor-server shows the jobs, condor_q on condor-exec does not
> - I have no file I/O. At this point, I'm just using the simple.c
> example from here:
> Any ideas suggestions greatly appreciated.