[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] issues with negotiator and configuration



Hello, 

I've installed condor on 3 Ubuntu based machines using the latest release of condor from the debian repository.  This was based on using apt-get install condor after updating my local sources.  Everything installed without problems. That install uses the personal condor configuration which must be customized. 

I have 1 machine that is the host and 2 machines that are submit/execute.  I specialized the local config file. Here it is for the submit machine.  I used ufw to open the ports needed.

I don't think that the connection to the negotiator is working correctly.  When I execute condor_q it reports the queue, quickly.  When I execute condor_q -analyze,  it takes a very long time and then reports a communication error to the negotiation daemon.  I've also included sample output from what I think are the relevant log files.  I've searched on this issue using google etc., while there were others that had something similar I wasn't able to find a clear/clean resolution.  I'm wondering if anyone has some insights on this.

setti@MDR-U1-VM:~$ condor_q -analyze
Error: Could not connect to negotiator (MDR-BELL4164-U1)

-----------------------------------------

##  What machine is your central manager?

#CONDOR_HOST = $(FULL_HOSTNAME)
CONDOR_HOST = 130.184.159.107

ALLOW_ADMINISTRATOR = $(CONDOR_HOST),$(IP_ADDRESS),$(FULL_HOSTNAME)

## Pool's short description

COLLECTOR_NAME = UA-INEG-CONDOR at $(CONDOR_HOST)

CONDOR_ADMIN = rossetti@xxxxxxxx

ALLOW_READ = $(CONDOR_HOST), $(FULL_HOSTNAME), $(IP_ADDRESS), 130.184.*.*,130.184.159.107, 130.184.158.107,130.184.159.6
ALLOW_WRITE = $(CONDOR_HOST), $(FULL_HOSTNAME), $(IP_ADDRESS), 130.184.*.*,130.184.159.107, 130.184.158.107,130.184.159.6

HIGHPORT = 9700
LOWPORT = 9600

NETWORK_INTERFACE = 130.184.158.107

##  When is this machine willing to start a job? 

START = TRUE


##  When to suspend a job?

SUSPEND = FALSE


##  When to nicely stop a job?
##  (as opposed to killing it instantaneously)

PREEMPT = FALSE


##  When to instantaneously kill a preempting job
##  (e.g. if a job is in the pre-empting stage for too long)

KILL = FALSE

##  This macro determines what daemons the condor_master will start and keep its watchful eyes on.
##  The list is a comma or space separated list of subsystem names

DAEMON_LIST = MASTER, SCHEDD, STARTD

------------------------------------------------------


08/17/11 20:43:42 Communicating with shadow <130.184.158.107:9609?noUDP>
08/17/11 20:43:42 Submitting machine is "uaf38185.ddns.uark.edu"


8/17/11 20:43:42 ERROR: the submitting host claims to be in our UidDomain (MDR-U1-VM), yet its hostname (uaf38185.ddns.uark.edu) does not match.  If the above hostname is actually an IP address, Condor could not perform a reverse DNS lookup to convert the IP back into a name.  To solve this problem, you can either correctly configure DNS to allow the reverse lookup, or you can enable TRUST_UID_DOMAIN in your condor configuration.

08/17/11 20:43:42 IPVERIFY: unable to resolve IP address of MDR-U1-VMs
08/17/11 20:43:42 IPVERIFY: unable to resolve IP address of MDR-U1-VMs

08/17/11 20:23:42 (pid:1988) Sent ad to central manager for rossetti@MDR-U1-VM
08/17/11 20:23:42 (pid:1988) Sent ad to 1 collectors for rossetti@MDR-U1-VM
08/17/11 20:24:03 (pid:1988) attempt to connect to <130.184.159.107:9646> failed: timed out after 20 seconds.
08/17/11 20:24:03 (pid:1988) Failed to send RESCHEDULE to negotiator MDR-BELL4164-U1: SECMAN:2004:Failed to create security session to <130.184.159.107:9646> with TCP.
|SECMAN:2003:TCP connection to <130.184.159.107:9646> failed.

08/17/11 20:43:42 (pid:1988) Sent ad to central manager for rossetti@MDR-U1-VM
08/17/11 20:43:42 (pid:1988) Sent ad to 1 collectors for rossetti@MDR-U1-VM
08/17/11 20:43:42 (pid:1988) Haven't heard from negotiator, trying to claim local startd @ <130.184.158.107:9675>
08/17/11 20:43:42 (pid:1988) Checking consistency running and runnable jobs
08/17/11 20:43:42 (pid:1988) Tables are consistent
08/17/11 20:43:42 (pid:1988) Rebuilt prioritized runnable job list in 0.000s.
08/17/11 20:43:42 (pid:1988) Claiming local startd slot 1 at <130.184.158.107:9675>
08/17/11 20:43:42 (pid:1988) Negotiator gone, trying to use our local startd
08/17/11 20:43:42 (pid:1988) Completed REQUEST_CLAIM to startd MDR-U1-VM <130.184.158.107:9675> for rossetti
08/17/11 20:43:42 (pid:1988) Starting add_shadow_birthdate(1.0)
08/17/11 20:43:42 (pid:1988) Started shadow for job 1.0 on MDR-U1-VM <130.184.158.107:9675> for rossetti, (shadow pid = 6690)
08/17/11 20:43:43 (pid:1988) Shadow pid 6690 for job 1.0 reports job exit reason 100.
08/17/11 20:43:43 (pid:1988) Checking consistency running and runnable jobs
08/17/11 20:43:43 (pid:1988) Tables are consistent
08/17/11 20:43:43 (pid:1988) Rebuilt prioritized runnable job list in 0.000s.  (Expedited rebuild because no match was found)
08/17/11 20:43:43 (pid:1988) Completed RELEASE_CLAIM to startd at <130.184.158.107:9675>
08/17/11 20:43:43 (pid:1988) Match record (MDR-U1-VM <130.184.158.107:9675> for rossetti, 1.0) deleted
08/17/11 20:44:03 (pid:1988) attempt to connect to <130.184.159.107:9646> failed: timed out after 20 seconds.
08/17/11 20:44:03 (pid:1988) Failed to send RESCHEDULE to negotiator MDR-BELL4164-U1: SECMAN:2004:Failed to create security session to <130.184.159.107:9646> with TCP.

-----------------------------------------------------
Manuel D. Rossetti, Ph.D., P.E.
Professor and Associate Department Head
University of Arkansas 
Department of Industrial Engineering
4207 Bell Engineering Center
Fayetteville, AR 72701
Phone: (479) 575-6756
Fax: (479) 575-8431
email: rossetti@xxxxxxxx
www: www.uark.edu/~rossetti