[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 'Can't find address of local schedd' appeared after restarting the cluster



Hi,

I have also removed the 'InstanceLock' on each machine and restarted the
condor service, but this does not help.

Cheers,Gang
> Dear Condor expert:
>
> Due to power cut we shut down our cluster 2 days ago, today I bring the
> cluster up and encounter the following error when submitting condor
> jobs(I didn't change any condor configuration):
>
> [valtical09] /data5/qing/condor_test > condor_submit
> data11_177986_Egamma.txt.6.job
> ERROR: Can't find address of local schedd
>
> 1. The condor_status works on the WN:
>
> [valtical09] /data5/qing/condor_test > condor_status | grep valtical09
> slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 1.000 3012 0+00:19:43
> slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:05
> slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:06
> slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:07
> slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:08
> slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:09
> slot7@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:10
> slot8@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:03
>
> 2. The condor manger is working fine:
>
> [root@valtical00 /]# service condor status
> Condor is running (pid 18441)
> [root@valtical00 /]# ps -ef | grep condor
> condor 18441 1 0 22:56 ? 00:00:01
> /opt/condor-7.5.0/usr/sbin/condor_master -pidfile
> /opt/condor-7.5.0/var/run/condor/condor.pid
> condor 18442 18441 0 22:56 ? 00:00:00 condor_collector -f
> condor 18443 18441 0 22:56 ? 00:00:00 condor_negotiator -f
> condor 18444 18441 0 22:56 ? 00:00:00 condor_schedd -f
> condor 18445 18441 0 22:56 ? 00:00:00 condor_startd -f
> root 18446 18444 0 22:56 ? 00:00:00 condor_procd -A
> /opt/condor-7.5.0/var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 102
> root 19200 16803 0 23:24 pts/2 00:00:00 grep condor
>
>
> 3. The condor daemon on the WN is also working fine:
>
> [valtical09] /data5/qing/condor_test > service condor status
> Condor is running (pid 10677)
> [valtical09] /data5/qing/condor_test > ps -ef | grep condor
> condor 10677 1 0 23:04 ? 00:00:00
> /opt/condor-7.5.0/usr/sbin/condor_master -pidfile
> /opt/condor-7.5.0/var/run/condor/condor.pid
> condor 10678 10677 0 23:04 ? 00:00:00 condor_startd -f
> qing 10820 10360 0 23:25 pts/5 00:00:00 grep condor
>
> 4. All work nodes are allowed to read and write the manager
>
> [root@valtical00 spool]# condor_config_val -verbose HOSTALLOW_READ
> HOSTALLOW_READ: *.cern.ch
> Defined in '/opt/condor-7.5.0/etc/condor/condor_config.local', line 43.
>
> [root@valtical00 condor]# condor_config_val -verbose HOSTALLOW_WRITE
> HOSTALLOW_WRITE: *.cern.ch
> Defined in '/opt/condor-7.5.0/etc/condor/condor_config.local', line 44
>
> 5. The disks is not full:
>
> [root@valtical00 spool]# df -l
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/mapper/VolGroup00-LogVol00
> 447080904 40500792 383503248 10% /
> /dev/sda1 101086 26536 69331 28% /boot
> tmpfs 12337996 0 12337996 0% /dev/shm
> /dev/sdb1 1892333360 1762428688 32229052 99% /localdisk
> /dev/sdc1 1892333360 1762081812 32575928 99% /localdisk2
> /dev/sdd1 1922858352 1256654720 568528032 69% /localdisk3
> /dev/sde1 1922858352 1242946828 582235924 69% /localdisk4
> /dev/sdg1 1922858352 1792627332 32555420 99% /localdisk5
> /dev/sdf1 1922858352 189594508 1635588244 11% /work
> /dev/sdh1 1922858352 1757815520 67367232 97% /data5
> AFS 9000000 0 9000000 0% /afs
>
> 6. Some log of CollectorLog:
>
> 12/11/11 23:30:27 (Sending 49 ads in response to query)
> 12/11/11 23:31:24 NegotiatorAd : Inserting ** "< valtical00.cern.ch >"
> 12/11/11 23:31:27 (Sending 59 ads in response to query)
> 12/11/11 23:31:27 Got QUERY_STARTD_PVT_ADS
> 12/11/11 23:31:27 (Sending 49 ads in response to query)
> 12/11/11 23:32:27 (Sending 59 ads in response to query)
> 12/11/11 23:32:27 Got QUERY_STARTD_PVT_ADS
> 12/11/11 23:32:27 (Sending 49 ads in response to query)
> 12/11/11 23:33:27 (Sending 59 ads in response to query)
> 12/11/11 23:33:27 Got QUERY_STARTD_PVT_ADS
> 12/11/11 23:33:27 (Sending 49 ads in response to query)
> 12/11/11 23:34:27 (Sending 59 ads in response to query)
> 12/11/11 23:34:27 Got QUERY_STARTD_PVT_ADS
> 12/11/11 23:34:27 (Sending 49 ads in response to query)
> 12/11/11 23:35:27 (Sending 59 ads in response to query)
> 12/11/11 23:35:27 Got QUERY_STARTD_PVT_ADS
> 12/11/11 23:35:27 (Sending 49 ads in response to query)
>
> 7. Some log of MasterLog:
>
> 12/11/11 22:56:24 Setting maximum accepts per cycle 4.
> 12/11/11 22:56:24 ******************************************************
> 12/11/11 22:56:24 ** condor_master (CONDOR_MASTER) STARTING UP
> 12/11/11 22:56:24 ** /opt/condor-7.5.0/usr/sbin/condor_master
> 12/11/11 22:56:24 ** SubsystemInfo: name=MASTER type=MASTER(2)
> class=DAEMON(1)
> 12/11/11 22:56:24 ** Configuration: subsystem:MASTER local:<NONE>
> class:DAEMON
> 12/11/11 22:56:24 ** $CondorVersion: 7.6.4 Oct 20 2011 BuildID: 379441 $
> 12/11/11 22:56:24 ** $CondorPlatform: x86_64_rhap_5 $
> 12/11/11 22:56:24 ** PID = 18441
> 12/11/11 22:56:24 ** Log last touched time unavailable (No such file or
> directory)
> 12/11/11 22:56:24 ******************************************************
> 12/11/11 22:56:24 Using config source:
> /opt/condor-7.5.0/etc/condor/condor_config
> 12/11/11 22:56:24 Using local config sources:
> 12/11/11 22:56:24 /opt/condor-7.5.0/etc/condor/condor_config.local
> 12/11/11 22:56:24 DaemonCore: command socket at <137.138.40.140:59895>
> 12/11/11 22:56:24 DaemonCore: private command socket at
> <137.138.40.140:59895>
> 12/11/11 22:56:24 Setting maximum accepts per cycle 4.
> 12/11/11 22:56:24 Started DaemonCore process
> "/opt/condor-7.5.0/usr/sbin/condor_collector", pid and pgroup = 18442
> 12/11/11 22:56:24 Started DaemonCore process
> "/opt/condor-7.5.0/usr/sbin/condor_negotiator", pid and pgroup = 18443
> 12/11/11 22:56:24 Started DaemonCore process
> "/opt/condor-7.5.0/usr/sbin/condor_schedd", pid and pgroup = 18444
> 12/11/11 22:56:24 Started DaemonCore process
> "/opt/condor-7.5.0/usr/sbin/condor_startd", pid and pgroup = 18445
>
> 8. Some info in NegotiatorLog:
>
> 12/11/11 23:37:27 Phase 1: Obtaining ads from collector ...
> 12/11/11 23:37:27 Getting all public ads ...
> 12/11/11 23:37:28 Sorting 59 ads ...
> 12/11/11 23:37:28 Getting startd private ads ...
> 12/11/11 23:37:28 Got ads: 59 public and 49 private
> 12/11/11 23:37:28 Public ads include 0 submitter, 49 startd
> 12/11/11 23:37:28 Phase 2: Performing accounting ...
> 12/11/11 23:37:28 Phase 3: Sorting submitter ads by priority ...
> 12/11/11 23:37:28 Phase 4.1: Negotiating with schedds ...
> 12/11/11 23:37:28 negotiateWithGroup resources used scheddAds length 0
> 12/11/11 23:37:28 ---------- Finished Negotiation Cycle ----------
> 12/11/11 23:38:28 ---------- Started Negotiation Cycle ----------
> 12/11/11 23:38:28 Phase 1: Obtaining ads from collector ...
> 12/11/11 23:38:28 Getting all public ads ...
> 12/11/11 23:38:28 Sorting 59 ads ...
> 12/11/11 23:38:28 Getting startd private ads ...
> 12/11/11 23:38:28 Got ads: 59 public and 49 private
> 12/11/11 23:38:28 Public ads include 0 submitter, 49 startd
> 12/11/11 23:38:28 Phase 2: Performing accounting ...
> 12/11/11 23:38:28 Phase 3: Sorting submitter ads by priority ...
> 12/11/11 23:38:28 Phase 4.1: Negotiating with schedds ...
> 12/11/11 23:38:28 negotiateWithGroup resources used scheddAds length 0
> 12/11/11 23:38:28 ---------- Finished Negotiation Cycle ----------
>
> 9. some info in SchedLog:
>
> 12/11/11 22:56:24 (pid:18444) Setting maximum accepts per cycle 4.
> 12/11/11 22:56:24 (pid:18444)
> ******************************************************
> 12/11/11 22:56:24 (pid:18444) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 12/11/11 22:56:24 (pid:18444) ** /opt/condor-7.5.0/usr/sbin/condor_schedd
> 12/11/11 22:56:24 (pid:18444) ** SubsystemInfo: name=SCHEDD
> type=SCHEDD(5) class=DAEMON(1)
> 12/11/11 22:56:24 (pid:18444) ** Configuration: subsystem:SCHEDD
> local:<NONE> class:DAEMON
> 12/11/11 22:56:24 (pid:18444) ** $CondorVersion: 7.6.4 Oct 20 2011
> BuildID: 379441 $
> 12/11/11 22:56:24 (pid:18444) ** $CondorPlatform: x86_64_rhap_5 $
> 12/11/11 22:56:24 (pid:18444) ** PID = 18444
> 12/11/11 22:56:24 (pid:18444) ** Log last touched time unavailable (No
> such file or directory)
> 12/11/11 22:56:24 (pid:18444)
> ******************************************************
> 12/11/11 22:56:24 (pid:18444) Using config source:
> /opt/condor-7.5.0/etc/condor/condor_config
> 12/11/11 22:56:24 (pid:18444) Using local config sources:
> 12/11/11 22:56:24 (pid:18444)
> /opt/condor-7.5.0/etc/condor/condor_config.local
> 12/11/11 22:56:24 (pid:18444) DaemonCore: command socket at
> <137.138.40.140:35736>
> 12/11/11 22:56:24 (pid:18444) DaemonCore: private command socket at
> <137.138.40.140:35736>
> 12/11/11 22:56:24 (pid:18444) Setting maximum accepts per cycle 4.
> 12/11/11 22:56:24 (pid:18444) History file rotation is enabled.
> 12/11/11 22:56:24 (pid:18444) Maximum history file size is: 20971520 bytes
> 12/11/11 22:56:24 (pid:18444) Number of rotated history files is: 2
> 12/11/11 22:56:29 (pid:18444) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 12/11/11 23:01:29 (pid:18444) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 12/11/11 23:06:30 (pid:18444) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 12/11/11 23:11:31 (pid:18444) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 12/11/11 23:16:32 (pid:18444) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 12/11/11 23:21:33 (pid:18444) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 12/11/11 23:26:34 (pid:18444) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 12/11/11 23:31:35 (pid:18444) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 12/11/11 23:36:36 (pid:18444) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>
> 10. Info in StartLog:
>
> 12/11/11 22:56:24 Setting maximum accepts per cycle 4.
> 12/11/11 22:56:24 ******************************************************
> 12/11/11 22:56:24 ** condor_startd (CONDOR_STARTD) STARTING UP
> 12/11/11 22:56:24 ** /opt/condor-7.5.0/usr/sbin/condor_startd
> 12/11/11 22:56:24 ** SubsystemInfo: name=STARTD type=STARTD(7)
> class=DAEMON(1)
> 12/11/11 22:56:24 ** Configuration: subsystem:STARTD local:<NONE>
> class:DAEMON
> 12/11/11 22:56:24 ** $CondorVersion: 7.6.4 Oct 20 2011 BuildID: 379441 $
> 12/11/11 22:56:24 ** $CondorPlatform: x86_64_rhap_5 $
> 12/11/11 22:56:24 ** PID = 18445
> 12/11/11 22:56:24 ** Log last touched time unavailable (No such file or
> directory)
> 12/11/11 22:56:24 ******************************************************
> 12/11/11 22:56:24 Using config source:
> /opt/condor-7.5.0/etc/condor/condor_config
> 12/11/11 22:56:24 Using local config sources:
> 12/11/11 22:56:24 /opt/condor-7.5.0/etc/condor/condor_config.local
> 12/11/11 22:56:24 DaemonCore: command socket at <137.138.40.140:47336>
> 12/11/11 22:56:24 DaemonCore: private command socket at
> <137.138.40.140:47336>
> 12/11/11 22:56:24 Setting maximum accepts per cycle 4.
> 12/11/11 22:56:29 VM-gahp server reported an internal error
> 12/11/11 22:56:29 VM universe will be tested to check if it is available
> 12/11/11 22:56:29 History file rotation is enabled.
> 12/11/11 22:56:29 Maximum history file size is: 20971520 bytes
> 12/11/11 22:56:29 Number of rotated history files is: 2
> 12/11/11 22:56:29 New machine resource allocated
> 12/11/11 22:56:29 CronJobList: Adding job 'mips'
> 12/11/11 22:56:29 CronJobList: Adding job 'kflops'
> 12/11/11 22:56:29 CronJob: Initializing job 'mips'
> (/opt/condor-7.5.0/usr/libexec/condor/condor_mips)
> 12/11/11 22:56:29 CronJob: Initializing job 'kflops'
> (/opt/condor-7.5.0/usr/libexec/condor/condor_kflops)
> 12/11/11 22:56:29 State change: IS_OWNER is false
> 12/11/11 22:56:29 Changing state: Owner -> Unclaimed
> 12/11/11 22:56:29 State change: RunBenchmarks is TRUE
> 12/11/11 22:56:29 Changing activity: Idle -> Benchmarking
> 12/11/11 22:56:29 BenchMgr:StartBenchmarks()
> 12/11/11 22:56:53 State change: benchmarks completed
> 12/11/11 22:56:53 Changing activity: Benchmarking -> Idle
>
> Any idea where the problem is?
>
> Cheers,Gang
>
>
>
>