[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Cannot add execute node to pool



Thanks for the tips on the firewalls Tim,  I should have looked into
this more closely, especially with the logs stating that the exec node
slots were attempting to join by UDP.

The sys-admin for these machines told me that the firewall had been
punched for UDP and TCP for a known condor range, but it was actually
only punched for TCP.

I wrote a little python UDP client/server to test for this (with the
server bound to "ALL"), and saw that messages weren't getting through.
 I've attached this in case it is helpful to anyone else.



-Mike



On Thu, Jun 9, 2011 at 1:09 PM, Michael Grauer
<michael.grauer@xxxxxxxxxxx> wrote:
> Thanks Tim.
>
> My firewalls have been punched between these two machines for portsh
> 9600-9700, and I verified this (at least for TCP) with telnet.
>
> I verified that my host info is set property on the exec node
> (hector,.228), and it seems to be correctly listing my central manager
> node (dione,.185), see below.
>
> I've set STARTD_DEBUG = FULL_DEBUG on the exec node, and am attaching
> the output.
>
> This output looks similar to output on another setup I have with two
> Ubuntu and Condor 7.4.4 machines, where the exec node correctly
> connects.  In this situation I see in the CollectorLog
>
> StartdAd     : Inserting ** "< slot3@$$exec node's name$$, 10.171.2.232 >"
>
> In the case of my exec node that won't join, I don't see anything in
> the CollectorLog saying success or failure.
>
>
>
>
>
>
>
>
> condor_config_val -dump | grep HOST
> ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
> ALLOW_NEGOTIATOR = $(CONDOR_HOST)
> ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
> ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
> COLLECTOR_HOST = dione.ia.unc.edu
> COLLECTOR_HOST_STRING = "$(COLLECTOR_HOST)"
> CONDOR_ADMIN = root@$(FULL_HOSTNAME)
> CONDOR_HOST = dione.ia.unc.edu
> FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
> FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
> FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
> FULL_HOSTNAME = hector.ia.unc.edu
> HOSTNAME = hector
> STARTD_ATTRS = COLLECTOR_HOST_STRING
> TCP_FORWARDING_HOST =
>
>
>
> On Thu, Jun 9, 2011 at 9:38 AM, Timothy St. Clair <tstclair@xxxxxxxxxx> wrote:
>> You might want to verify that your _HOST information is set properly on
>> your exec node, and that your firewall's have been punched
>> appropriately.
>>
>> condor_config_val -dump | grep HOST
>>
>> verify CONDOR_HOST and COLLECTOR_HOST are correct.
>>
>> If all else fails set:
>> STARTD_DEBUG = D_FULLDEBUG and repost
>>
>> Cheers,
>> Tim
>>
>> On Wed, 2011-06-08 at 16:47 -0400, Michael Grauer wrote:
>>> I'm running Condor 7.6.1 on two different CentOS 5.6 machines, one (2
>>> cpus, call it Twoproc) being the CONDOR_HOST (and for now is also a
>>> submit and execute node), and the other (16 cpus, call it Sixteenproc)
>>> I would like to add to this grid as an execute node, but can't get the
>>> slots added.
>>>
>>> Sixteenproc has MASTER and STARTD daemons running, and when I call
>>> condor_status on it, it returns the 2 slots from Twoproc, so it seems
>>> that Sixteenproc can correctly connect to Twoproc.
>>>
>>> I can't seem to find any evidence in the logs on either machine that
>>> Sixteenproc is trying to get its slots added to Twoproc's grid.
>>>
>>>
>>>
>>> Any advice on how to debug this?  It would be much appreciated.
>>>
>>> I'm appending the StartLog output from Sixteenproc in case that helps.
>>>
>>>
>>>
>>> Thanks,
>>> Mike
>>>
>>>
>>>
>>>
>>> 06/08/11 15:31:55 Setting maximum accepts per cycle 4.
>>> 06/08/11 15:31:55 ******************************************************
>>> 06/08/11 15:31:55 ** condor_startd (CONDOR_STARTD) STARTING UP
>>> 06/08/11 15:31:55 ** /usr/sbin/condor_startd
>>> 06/08/11 15:31:55 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
>>> 06/08/11 15:31:55 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
>>> 06/08/11 15:31:55 ** $CondorVersion: 7.6.1 May 31 2011 BuildID: 339001 $
>>> 06/08/11 15:31:55 ** $CondorPlatform: x86_64_rhap_5 $
>>> 06/08/11 15:31:55 ** PID = 2293
>>> 06/08/11 15:31:55 ** Log last touched time unavailable (No such file
>>> or directory)
>>> 06/08/11 15:31:55 ******************************************************
>>> 06/08/11 15:31:55 Using config source: /etc/condor/condor_config
>>> 06/08/11 15:31:55 Using local config sources:
>>> 06/08/11 15:31:55    /etc/condor/condor_config.local
>>> 06/08/11 15:31:55 DaemonCore: command socket at <SixteenProc'sIP:9608>
>>> 06/08/11 15:31:55 DaemonCore: private command socket at <SixteenProc'sIP:9608>
>>> 06/08/11 15:31:55 Setting maximum accepts per cycle 4.
>>> 06/08/11 15:32:01 VM-gahp server reported an internal error
>>> 06/08/11 15:32:01 VM universe will be tested to check if it is available
>>> 06/08/11 15:32:01 History file rotation is enabled.
>>> 06/08/11 15:32:01   Maximum history file size is: 20971520 bytes
>>> 06/08/11 15:32:01   Number of rotated history files is: 2
>>> 06/08/11 15:32:01 slot1: New machine resource allocated
>>> 06/08/11 15:32:01 slot2: New machine resource allocated
>>> 06/08/11 15:32:01 slot3: New machine resource allocated
>>> 06/08/11 15:32:01 slot4: New machine resource allocated
>>> 06/08/11 15:32:01 slot5: New machine resource allocated
>>> 06/08/11 15:32:01 slot6: New machine resource allocated
>>> 06/08/11 15:32:01 slot7: New machine resource allocated
>>> 06/08/11 15:32:01 slot8: New machine resource allocated
>>> 06/08/11 15:32:01 slot9: New machine resource allocated
>>> 06/08/11 15:32:01 slot10: New machine resource allocated
>>> 06/08/11 15:32:01 slot11: New machine resource allocated
>>> 06/08/11 15:32:01 slot12: New machine resource allocated
>>> 06/08/11 15:32:01 slot13: New machine resource allocated
>>> 06/08/11 15:32:01 slot14: New machine resource allocated
>>> 06/08/11 15:32:01 slot15: New machine resource allocated
>>> 06/08/11 15:32:01 slot16: New machine resource allocated
>>> 06/08/11 15:32:01 CronJobList: Adding job 'mips'
>>> 06/08/11 15:32:01 CronJobList: Adding job 'kflops'
>>> 06/08/11 15:32:01 CronJob: Initializing job 'mips'
>>> (/usr/libexec/condor/condor_mips)
>>> 06/08/11 15:32:01 CronJob: Initializing job 'kflops'
>>> (/usr/libexec/condor/condor_kflops)
>>> 06/08/11 15:32:01 slot1: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot1: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot1: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 BenchMgr:StartBenchmarks()
>>> 06/08/11 15:32:01 slot2: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot2: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot2: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot2: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot3: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot3: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot3: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot3: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot4: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot4: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot4: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot4: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot5: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot5: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot5: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot5: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot6: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot6: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot6: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot6: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot7: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot7: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot7: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot7: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot8: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot8: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot8: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot8: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot9: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot9: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot9: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot9: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot10: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot10: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot10: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot10: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot11: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot11: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot11: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot11: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot12: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot12: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot12: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot12: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot13: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot13: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot13: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot13: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot14: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot14: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot14: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot14: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot15: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot15: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot15: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot15: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:01 slot16: State change: IS_OWNER is false
>>> 06/08/11 15:32:01 slot16: Changing state: Owner -> Unclaimed
>>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>>> 06/08/11 15:32:01 slot16: Changing activity: Idle -> Benchmarking
>>> 06/08/11 15:32:01 slot16: Changing activity: Benchmarking -> Idle
>>> 06/08/11 15:32:23 State change: benchmarks completed
>>> 06/08/11 15:32:23 slot1: Changing activity: Benchmarking -> Idle
>>> _______________________________________________
>>> Condor-users mailing list
>>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>
import socket
import sys




def server(host,port):
  if host == 'ALL':
    host = ''  
  sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
  sock.bind((host,port))
  print "Server listening on (",host,",",port,")"

  while True:
    data, addr = sock.recvfrom(1024)
    print "from [",addr,"] received message:", data


def client(host,port):
  print "client sending MSG to (",host,",",port,")"
  MESSAGE="Hello, World!"
  sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
  sock.sendto( MESSAGE, (host, port) )



if __name__ == "__main__":
  usage = "usage:\n\npython utest.py s|c host port\n\nwhere s is for server, c is for client\nhost is either an ip or ALL for all ips\n\n"
  if len(sys.argv) < 4:
    print usage
    exit
  else:
    host = sys.argv[2]
    port = int(sys.argv[3])
    if sys.argv[1] == 's':
      server(host,port)
    elif sys.argv[1] == 'c':
      client(host,port)
    else:
      print usage
      exit