[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 'Can't find address of local schedd' appeared after restarting the cluster



Dear Condor expert:

The problem is fixed.

Cheers,Gang
> Hi,
>
> I have also removed the 'InstanceLock' on each machine and restarted the
> condor service, but this does not help.
>
> Cheers,Gang
>   
>> Dear Condor expert:
>>
>> Due to power cut we shut down our cluster 2 days ago, today I bring the
>> cluster up and encounter the following error when submitting condor
>> jobs(I didn't change any condor configuration):
>>
>> [valtical09] /data5/qing/condor_test > condor_submit
>> data11_177986_Egamma.txt.6.job
>> ERROR: Can't find address of local schedd
>>
>> 1. The condor_status works on the WN:
>>
>> [valtical09] /data5/qing/condor_test > condor_status | grep valtical09
>> slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 1.000 3012 0+00:19:43
>> slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:05
>> slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:06
>> slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:07
>> slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:08
>> slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:09
>> slot7@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:10
>> slot8@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:03
>>
>> 2. The condor manger is working fine:
>>
>> [root@valtical00 /]# service condor status
>> Condor is running (pid 18441)
>> [root@valtical00 /]# ps -ef | grep condor
>> condor 18441 1 0 22:56 ? 00:00:01
>> /opt/condor-7.5.0/usr/sbin/condor_master -pidfile
>> /opt/condor-7.5.0/var/run/condor/condor.pid
>> condor 18442 18441 0 22:56 ? 00:00:00 condor_collector -f
>> condor 18443 18441 0 22:56 ? 00:00:00 condor_negotiator -f
>> condor 18444 18441 0 22:56 ? 00:00:00 condor_schedd -f
>> condor 18445 18441 0 22:56 ? 00:00:00 condor_startd -f
>> root 18446 18444 0 22:56 ? 00:00:00 condor_procd -A
>> /opt/condor-7.5.0/var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 102
>> root 19200 16803 0 23:24 pts/2 00:00:00 grep condor
>>
>>
>> 3. The condor daemon on the WN is also working fine:
>>
>> [valtical09] /data5/qing/condor_test > service condor status
>> Condor is running (pid 10677)
>> [valtical09] /data5/qing/condor_test > ps -ef | grep condor
>> condor 10677 1 0 23:04 ? 00:00:00
>> /opt/condor-7.5.0/usr/sbin/condor_master -pidfile
>> /opt/condor-7.5.0/var/run/condor/condor.pid
>> condor 10678 10677 0 23:04 ? 00:00:00 condor_startd -f
>> qing 10820 10360 0 23:25 pts/5 00:00:00 grep condor
>>
>> 4. All work nodes are allowed to read and write the manager
>>
>> [root@valtical00 spool]# condor_config_val -verbose HOSTALLOW_READ
>> HOSTALLOW_READ: *.cern.ch
>> Defined in '/opt/condor-7.5.0/etc/condor/condor_config.local', line 43.
>>
>> [root@valtical00 condor]# condor_config_val -verbose HOSTALLOW_WRITE
>> HOSTALLOW_WRITE: *.cern.ch
>> Defined in '/opt/condor-7.5.0/etc/condor/condor_config.local', line 44
>>
>> 5. The disks is not full:
>>
>> [root@valtical00 spool]# df -l
>> Filesystem 1K-blocks Used Available Use% Mounted on
>> /dev/mapper/VolGroup00-LogVol00
>> 447080904 40500792 383503248 10% /
>> /dev/sda1 101086 26536 69331 28% /boot
>> tmpfs 12337996 0 12337996 0% /dev/shm
>> /dev/sdb1 1892333360 1762428688 32229052 99% /localdisk
>> /dev/sdc1 1892333360 1762081812 32575928 99% /localdisk2
>> /dev/sdd1 1922858352 1256654720 568528032 69% /localdisk3
>> /dev/sde1 1922858352 1242946828 582235924 69% /localdisk4
>> /dev/sdg1 1922858352 1792627332 32555420 99% /localdisk5
>> /dev/sdf1 1922858352 189594508 1635588244 11% /work
>> /dev/sdh1 1922858352 1757815520 67367232 97% /data5
>> AFS 9000000 0 9000000 0% /afs
>>
>> 6. Some log of CollectorLog:
>>
>> 12/11/11 23:30:27 (Sending 49 ads in response to query)
>> 12/11/11 23:31:24 NegotiatorAd : Inserting ** "< valtical00.cern.ch >"
>> 12/11/11 23:31:27 (Sending 59 ads in response to query)
>> 12/11/11 23:31:27 Got QUERY_STARTD_PVT_ADS
>> 12/11/11 23:31:27 (Sending 49 ads in response to query)
>> 12/11/11 23:32:27 (Sending 59 ads in response to query)
>> 12/11/11 23:32:27 Got QUERY_STARTD_PVT_ADS
>> 12/11/11 23:32:27 (Sending 49 ads in response to query)
>> 12/11/11 23:33:27 (Sending 59 ads in response to query)
>> 12/11/11 23:33:27 Got QUERY_STARTD_PVT_ADS
>> 12/11/11 23:33:27 (Sending 49 ads in response to query)
>> 12/11/11 23:34:27 (Sending 59 ads in response to query)
>> 12/11/11 23:34:27 Got QUERY_STARTD_PVT_ADS
>> 12/11/11 23:34:27 (Sending 49 ads in response to query)
>> 12/11/11 23:35:27 (Sending 59 ads in response to query)
>> 12/11/11 23:35:27 Got QUERY_STARTD_PVT_ADS
>> 12/11/11 23:35:27 (Sending 49 ads in response to query)
>>
>> 7. Some log of MasterLog:
>>
>> 12/11/11 22:56:24 Setting maximum accepts per cycle 4.
>> 12/11/11 22:56:24 ******************************************************
>> 12/11/11 22:56:24 ** condor_master (CONDOR_MASTER) STARTING UP
>> 12/11/11 22:56:24 ** /opt/condor-7.5.0/usr/sbin/condor_master
>> 12/11/11 22:56:24 ** SubsystemInfo: name=MASTER type=MASTER(2)
>> class=DAEMON(1)
>> 12/11/11 22:56:24 ** Configuration: subsystem:MASTER local:<NONE>
>> class:DAEMON
>> 12/11/11 22:56:24 ** $CondorVersion: 7.6.4 Oct 20 2011 BuildID: 379441 $
>> 12/11/11 22:56:24 ** $CondorPlatform: x86_64_rhap_5 $
>> 12/11/11 22:56:24 ** PID = 18441
>> 12/11/11 22:56:24 ** Log last touched time unavailable (No such file or
>> directory)
>> 12/11/11 22:56:24 ******************************************************
>> 12/11/11 22:56:24 Using config source:
>> /opt/condor-7.5.0/etc/condor/condor_config
>> 12/11/11 22:56:24 Using local config sources:
>> 12/11/11 22:56:24 /opt/condor-7.5.0/etc/condor/condor_config.local
>> 12/11/11 22:56:24 DaemonCore: command socket at <137.138.40.140:59895>
>> 12/11/11 22:56:24 DaemonCore: private command socket at
>> <137.138.40.140:59895>
>> 12/11/11 22:56:24 Setting maximum accepts per cycle 4.
>> 12/11/11 22:56:24 Started DaemonCore process
>> "/opt/condor-7.5.0/usr/sbin/condor_collector", pid and pgroup = 18442
>> 12/11/11 22:56:24 Started DaemonCore process
>> "/opt/condor-7.5.0/usr/sbin/condor_negotiator", pid and pgroup = 18443
>> 12/11/11 22:56:24 Started DaemonCore process
>> "/opt/condor-7.5.0/usr/sbin/condor_schedd", pid and pgroup = 18444
>> 12/11/11 22:56:24 Started DaemonCore process
>> "/opt/condor-7.5.0/usr/sbin/condor_startd", pid and pgroup = 18445
>>
>> 8. Some info in NegotiatorLog:
>>
>> 12/11/11 23:37:27 Phase 1: Obtaining ads from collector ...
>> 12/11/11 23:37:27 Getting all public ads ...
>> 12/11/11 23:37:28 Sorting 59 ads ...
>> 12/11/11 23:37:28 Getting startd private ads ...
>> 12/11/11 23:37:28 Got ads: 59 public and 49 private
>> 12/11/11 23:37:28 Public ads include 0 submitter, 49 startd
>> 12/11/11 23:37:28 Phase 2: Performing accounting ...
>> 12/11/11 23:37:28 Phase 3: Sorting submitter ads by priority ...
>> 12/11/11 23:37:28 Phase 4.1: Negotiating with schedds ...
>> 12/11/11 23:37:28 negotiateWithGroup resources used scheddAds length 0
>> 12/11/11 23:37:28 ---------- Finished Negotiation Cycle ----------
>> 12/11/11 23:38:28 ---------- Started Negotiation Cycle ----------
>> 12/11/11 23:38:28 Phase 1: Obtaining ads from collector ...
>> 12/11/11 23:38:28 Getting all public ads ...
>> 12/11/11 23:38:28 Sorting 59 ads ...
>> 12/11/11 23:38:28 Getting startd private ads ...
>> 12/11/11 23:38:28 Got ads: 59 public and 49 private
>> 12/11/11 23:38:28 Public ads include 0 submitter, 49 startd
>> 12/11/11 23:38:28 Phase 2: Performing accounting ...
>> 12/11/11 23:38:28 Phase 3: Sorting submitter ads by priority ...
>> 12/11/11 23:38:28 Phase 4.1: Negotiating with schedds ...
>> 12/11/11 23:38:28 negotiateWithGroup resources used scheddAds length 0
>> 12/11/11 23:38:28 ---------- Finished Negotiation Cycle ----------
>>
>> 9. some info in SchedLog:
>>
>> 12/11/11 22:56:24 (pid:18444) Setting maximum accepts per cycle 4.
>> 12/11/11 22:56:24 (pid:18444)
>> ******************************************************
>> 12/11/11 22:56:24 (pid:18444) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
>> 12/11/11 22:56:24 (pid:18444) ** /opt/condor-7.5.0/usr/sbin/condor_schedd
>> 12/11/11 22:56:24 (pid:18444) ** SubsystemInfo: name=SCHEDD
>> type=SCHEDD(5) class=DAEMON(1)
>> 12/11/11 22:56:24 (pid:18444) ** Configuration: subsystem:SCHEDD
>> local:<NONE> class:DAEMON
>> 12/11/11 22:56:24 (pid:18444) ** $CondorVersion: 7.6.4 Oct 20 2011
>> BuildID: 379441 $
>> 12/11/11 22:56:24 (pid:18444) ** $CondorPlatform: x86_64_rhap_5 $
>> 12/11/11 22:56:24 (pid:18444) ** PID = 18444
>> 12/11/11 22:56:24 (pid:18444) ** Log last touched time unavailable (No
>> such file or directory)
>> 12/11/11 22:56:24 (pid:18444)
>> ******************************************************
>> 12/11/11 22:56:24 (pid:18444) Using config source:
>> /opt/condor-7.5.0/etc/condor/condor_config
>> 12/11/11 22:56:24 (pid:18444) Using local config sources:
>> 12/11/11 22:56:24 (pid:18444)
>> /opt/condor-7.5.0/etc/condor/condor_config.local
>> 12/11/11 22:56:24 (pid:18444) DaemonCore: command socket at
>> <137.138.40.140:35736>
>> 12/11/11 22:56:24 (pid:18444) DaemonCore: private command socket at
>> <137.138.40.140:35736>
>> 12/11/11 22:56:24 (pid:18444) Setting maximum accepts per cycle 4.
>> 12/11/11 22:56:24 (pid:18444) History file rotation is enabled.
>> 12/11/11 22:56:24 (pid:18444) Maximum history file size is: 20971520 bytes
>> 12/11/11 22:56:24 (pid:18444) Number of rotated history files is: 2
>> 12/11/11 22:56:29 (pid:18444) TransferQueueManager stats: active up=0/10
>> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>> 12/11/11 23:01:29 (pid:18444) TransferQueueManager stats: active up=0/10
>> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>> 12/11/11 23:06:30 (pid:18444) TransferQueueManager stats: active up=0/10
>> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>> 12/11/11 23:11:31 (pid:18444) TransferQueueManager stats: active up=0/10
>> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>> 12/11/11 23:16:32 (pid:18444) TransferQueueManager stats: active up=0/10
>> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>> 12/11/11 23:21:33 (pid:18444) TransferQueueManager stats: active up=0/10
>> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>> 12/11/11 23:26:34 (pid:18444) TransferQueueManager stats: active up=0/10
>> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>> 12/11/11 23:31:35 (pid:18444) TransferQueueManager stats: active up=0/10
>> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>> 12/11/11 23:36:36 (pid:18444) TransferQueueManager stats: active up=0/10
>> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
>>
>> 10. Info in StartLog:
>>
>> 12/11/11 22:56:24 Setting maximum accepts per cycle 4.
>> 12/11/11 22:56:24 ******************************************************
>> 12/11/11 22:56:24 ** condor_startd (CONDOR_STARTD) STARTING UP
>> 12/11/11 22:56:24 ** /opt/condor-7.5.0/usr/sbin/condor_startd
>> 12/11/11 22:56:24 ** SubsystemInfo: name=STARTD type=STARTD(7)
>> class=DAEMON(1)
>> 12/11/11 22:56:24 ** Configuration: subsystem:STARTD local:<NONE>
>> class:DAEMON
>> 12/11/11 22:56:24 ** $CondorVersion: 7.6.4 Oct 20 2011 BuildID: 379441 $
>> 12/11/11 22:56:24 ** $CondorPlatform: x86_64_rhap_5 $
>> 12/11/11 22:56:24 ** PID = 18445
>> 12/11/11 22:56:24 ** Log last touched time unavailable (No such file or
>> directory)
>> 12/11/11 22:56:24 ******************************************************
>> 12/11/11 22:56:24 Using config source:
>> /opt/condor-7.5.0/etc/condor/condor_config
>> 12/11/11 22:56:24 Using local config sources:
>> 12/11/11 22:56:24 /opt/condor-7.5.0/etc/condor/condor_config.local
>> 12/11/11 22:56:24 DaemonCore: command socket at <137.138.40.140:47336>
>> 12/11/11 22:56:24 DaemonCore: private command socket at
>> <137.138.40.140:47336>
>> 12/11/11 22:56:24 Setting maximum accepts per cycle 4.
>> 12/11/11 22:56:29 VM-gahp server reported an internal error
>> 12/11/11 22:56:29 VM universe will be tested to check if it is available
>> 12/11/11 22:56:29 History file rotation is enabled.
>> 12/11/11 22:56:29 Maximum history file size is: 20971520 bytes
>> 12/11/11 22:56:29 Number of rotated history files is: 2
>> 12/11/11 22:56:29 New machine resource allocated
>> 12/11/11 22:56:29 CronJobList: Adding job 'mips'
>> 12/11/11 22:56:29 CronJobList: Adding job 'kflops'
>> 12/11/11 22:56:29 CronJob: Initializing job 'mips'
>> (/opt/condor-7.5.0/usr/libexec/condor/condor_mips)
>> 12/11/11 22:56:29 CronJob: Initializing job 'kflops'
>> (/opt/condor-7.5.0/usr/libexec/condor/condor_kflops)
>> 12/11/11 22:56:29 State change: IS_OWNER is false
>> 12/11/11 22:56:29 Changing state: Owner -> Unclaimed
>> 12/11/11 22:56:29 State change: RunBenchmarks is TRUE
>> 12/11/11 22:56:29 Changing activity: Idle -> Benchmarking
>> 12/11/11 22:56:29 BenchMgr:StartBenchmarks()
>> 12/11/11 22:56:53 State change: benchmarks completed
>> 12/11/11 22:56:53 Changing activity: Benchmarking -> Idle
>>
>> Any idea where the problem is?
>>
>> Cheers,Gang
>>
>>
>>
>>   
>>     
>
>
>