[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] newbie negotiator error



Tom,

You should only need one negotiator daemon in your condor pool, not one on each machine.

It seems likely your security settings do not allow the failing negotiator to access the collector (i.e. ALLOW_NEGOTIATOR configuration setting). You should be able to see that by looking in CollectorLog.

--Dan

On 11/20/11 1:59 AM, Tom Melendez wrote:
Hi Folks,

I'm a newbie to condor (as in, "today") and I've done the "personal
condor" tutorial without issue.  I've also reviewed some of the slides
on the site.  I'm now trying to span my job across multiple machines,
but can see from the job log that only one is executing it.

On the "other machine", I see this in the error log:
11/19 23:46:27 ---------- Started Negotiation Cycle ----------
11/19 23:46:27 Phase 1:  Obtaining ads from collector ...
11/19 23:46:27   Getting all public ads ...
11/19 23:46:27   Sorting 9 ads ...
11/19 23:46:27   Getting startd private ads ...
11/19 23:46:27 condor_read(): recv() returned -1, errno = 104,
assuming failure reading 5 bytes from unknown source.
11/19 23:46:27 IO: Failed to read packet header
11/19 23:46:27 Couldn't fetch ads: communication error
11/19 23:46:27 Aborting negotiation cycle

Just a little about my setup to give you some context:

I have two machines (technically, these are VMs):
condor-server: this is the central manager and has submit, manager and
execute abilities.
condor-exec: this has execute and submit abilities
- both running Ubuntu 10.04, I installed condor via the packages and
use the start/stop scripts to execute it
- both machines are on the same subnet and I have entries in
/etc/hosts that point to each other with FQDNs.
- I did not set the NO_DNS option, I did set the DEFAULT_DOMAIN_NAME
option, but I don't think I need it due to the host settings above
- I tried using the NETWORK_INTERFACE option with the IP of the
condor-exec VM with no luck
- both machines are running all of the same daemons.  This is contrary
to some of the docs I've seen online (seems like
- the allow read and allow write options in the condor_config on both
machines is set to *
- the condor_host var in condor-exec points to the hostname of the condor-server
- condor_status can see both machines (the slots are all "unclaimed")
- condor_q on condor-server shows the jobs, condor_q on condor-exec does not
- I have no file I/O.  At this point, I'm just using the simple.c
example from here:
http://research.cs.wisc.edu/condor/tutorials/cw2005-condor/submit_first.html

Any ideas suggestions greatly appreciated.

Thanks,

Tom