[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Cannot add execute node to pool



Thanks Tim.

My firewalls have been punched between these two machines for ports
9600-9700, and I verified this (at least for TCP) with telnet.

I verified that my host info is set property on the exec node
(hector,.228), and it seems to be correctly listing my central manager
node (dione,.185), see below.

I've set STARTD_DEBUG = FULL_DEBUG on the exec node, and am attaching
the output.

This output looks similar to output on another setup I have with two
Ubuntu and Condor 7.4.4 machines, where the exec node correctly
connects.  In this situation I see in the CollectorLog

StartdAd     : Inserting ** "< slot3@$$exec node's name$$, 10.171.2.232 >"

In the case of my exec node that won't join, I don't see anything in
the CollectorLog saying success or failure.








condor_config_val -dump | grep HOST
ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
COLLECTOR_HOST = dione.ia.unc.edu
COLLECTOR_HOST_STRING = "$(COLLECTOR_HOST)"
CONDOR_ADMIN = root@$(FULL_HOSTNAME)
CONDOR_HOST = dione.ia.unc.edu
FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
FULL_HOSTNAME = hector.ia.unc.edu
HOSTNAME = hector
STARTD_ATTRS = COLLECTOR_HOST_STRING
TCP_FORWARDING_HOST =



On Thu, Jun 9, 2011 at 9:38 AM, Timothy St. Clair <tstclair@xxxxxxxxxx> wrote:
> You might want to verify that your _HOST information is set properly on
> your exec node, and that your firewall's have been punched
> appropriately.
>
> condor_config_val -dump | grep HOST
>
> verify CONDOR_HOST and COLLECTOR_HOST are correct.
>
> If all else fails set:
> STARTD_DEBUG = D_FULLDEBUG and repost
>
> Cheers,
> Tim
>
> On Wed, 2011-06-08 at 16:47 -0400, Michael Grauer wrote:
>> I'm running Condor 7.6.1 on two different CentOS 5.6 machines, one (2
>> cpus, call it Twoproc) being the CONDOR_HOST (and for now is also a
>> submit and execute node), and the other (16 cpus, call it Sixteenproc)
>> I would like to add to this grid as an execute node, but can't get the
>> slots added.
>>
>> Sixteenproc has MASTER and STARTD daemons running, and when I call
>> condor_status on it, it returns the 2 slots from Twoproc, so it seems
>> that Sixteenproc can correctly connect to Twoproc.
>>
>> I can't seem to find any evidence in the logs on either machine that
>> Sixteenproc is trying to get its slots added to Twoproc's grid.
>>
>>
>>
>> Any advice on how to debug this?  It would be much appreciated.
>>
>> I'm appending the StartLog output from Sixteenproc in case that helps.
>>
>>
>>
>> Thanks,
>> Mike
>>
>>
>>
>>
>> 06/08/11 15:31:55 Setting maximum accepts per cycle 4.
>> 06/08/11 15:31:55 ******************************************************
>> 06/08/11 15:31:55 ** condor_startd (CONDOR_STARTD) STARTING UP
>> 06/08/11 15:31:55 ** /usr/sbin/condor_startd
>> 06/08/11 15:31:55 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
>> 06/08/11 15:31:55 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
>> 06/08/11 15:31:55 ** $CondorVersion: 7.6.1 May 31 2011 BuildID: 339001 $
>> 06/08/11 15:31:55 ** $CondorPlatform: x86_64_rhap_5 $
>> 06/08/11 15:31:55 ** PID = 2293
>> 06/08/11 15:31:55 ** Log last touched time unavailable (No such file
>> or directory)
>> 06/08/11 15:31:55 ******************************************************
>> 06/08/11 15:31:55 Using config source: /etc/condor/condor_config
>> 06/08/11 15:31:55 Using local config sources:
>> 06/08/11 15:31:55    /etc/condor/condor_config.local
>> 06/08/11 15:31:55 DaemonCore: command socket at <SixteenProc'sIP:9608>
>> 06/08/11 15:31:55 DaemonCore: private command socket at <SixteenProc'sIP:9608>
>> 06/08/11 15:31:55 Setting maximum accepts per cycle 4.
>> 06/08/11 15:32:01 VM-gahp server reported an internal error
>> 06/08/11 15:32:01 VM universe will be tested to check if it is available
>> 06/08/11 15:32:01 History file rotation is enabled.
>> 06/08/11 15:32:01   Maximum history file size is: 20971520 bytes
>> 06/08/11 15:32:01   Number of rotated history files is: 2
>> 06/08/11 15:32:01 slot1: New machine resource allocated
>> 06/08/11 15:32:01 slot2: New machine resource allocated
>> 06/08/11 15:32:01 slot3: New machine resource allocated
>> 06/08/11 15:32:01 slot4: New machine resource allocated
>> 06/08/11 15:32:01 slot5: New machine resource allocated
>> 06/08/11 15:32:01 slot6: New machine resource allocated
>> 06/08/11 15:32:01 slot7: New machine resource allocated
>> 06/08/11 15:32:01 slot8: New machine resource allocated
>> 06/08/11 15:32:01 slot9: New machine resource allocated
>> 06/08/11 15:32:01 slot10: New machine resource allocated
>> 06/08/11 15:32:01 slot11: New machine resource allocated
>> 06/08/11 15:32:01 slot12: New machine resource allocated
>> 06/08/11 15:32:01 slot13: New machine resource allocated
>> 06/08/11 15:32:01 slot14: New machine resource allocated
>> 06/08/11 15:32:01 slot15: New machine resource allocated
>> 06/08/11 15:32:01 slot16: New machine resource allocated
>> 06/08/11 15:32:01 CronJobList: Adding job 'mips'
>> 06/08/11 15:32:01 CronJobList: Adding job 'kflops'
>> 06/08/11 15:32:01 CronJob: Initializing job 'mips'
>> (/usr/libexec/condor/condor_mips)
>> 06/08/11 15:32:01 CronJob: Initializing job 'kflops'
>> (/usr/libexec/condor/condor_kflops)
>> 06/08/11 15:32:01 slot1: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot1: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot1: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 BenchMgr:StartBenchmarks()
>> 06/08/11 15:32:01 slot2: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot2: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot2: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot2: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot3: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot3: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot3: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot3: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot4: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot4: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot4: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot4: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot5: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot5: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot5: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot5: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot6: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot6: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot6: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot6: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot7: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot7: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot7: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot7: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot8: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot8: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot8: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot8: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot9: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot9: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot9: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot9: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot10: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot10: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot10: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot10: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot11: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot11: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot11: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot11: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot12: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot12: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot12: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot12: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot13: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot13: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot13: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot13: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot14: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot14: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot14: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot14: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot15: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot15: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot15: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot15: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:01 slot16: State change: IS_OWNER is false
>> 06/08/11 15:32:01 slot16: Changing state: Owner -> Unclaimed
>> 06/08/11 15:32:01 State change: RunBenchmarks is TRUE
>> 06/08/11 15:32:01 slot16: Changing activity: Idle -> Benchmarking
>> 06/08/11 15:32:01 slot16: Changing activity: Benchmarking -> Idle
>> 06/08/11 15:32:23 State change: benchmarks completed
>> 06/08/11 15:32:23 slot1: Changing activity: Benchmarking -> Idle
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>

Attachment: StartLog.gz
Description: GNU Zip compressed data