[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] 'Can't find address of local schedd' appeared after restarting the cluster



Dear Condor expert:

Due to power cut we shut down our cluster 2 days ago, today I bring the
cluster up and encounter the following error when submitting condor
jobs(I didn't change any condor configuration):

[valtical09] /data5/qing/condor_test > condor_submit
data11_177986_Egamma.txt.6.job
ERROR: Can't find address of local schedd

1. The condor_status works on the WN:

[valtical09] /data5/qing/condor_test > condor_status | grep valtical09
slot1@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 1.000 3012 0+00:19:43
slot2@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:05
slot3@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:06
slot4@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:07
slot5@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:08
slot6@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:09
slot7@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:10
slot8@xxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 3012 0+00:20:03

2. The condor manger is working fine:

[root@valtical00 /]# service condor status
Condor is running (pid 18441)
[root@valtical00 /]# ps -ef | grep condor
condor 18441 1 0 22:56 ? 00:00:01
/opt/condor-7.5.0/usr/sbin/condor_master -pidfile
/opt/condor-7.5.0/var/run/condor/condor.pid
condor 18442 18441 0 22:56 ? 00:00:00 condor_collector -f
condor 18443 18441 0 22:56 ? 00:00:00 condor_negotiator -f
condor 18444 18441 0 22:56 ? 00:00:00 condor_schedd -f
condor 18445 18441 0 22:56 ? 00:00:00 condor_startd -f
root 18446 18444 0 22:56 ? 00:00:00 condor_procd -A
/opt/condor-7.5.0/var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 102
root 19200 16803 0 23:24 pts/2 00:00:00 grep condor


3. The condor daemon on the WN is also working fine:

[valtical09] /data5/qing/condor_test > service condor status
Condor is running (pid 10677)
[valtical09] /data5/qing/condor_test > ps -ef | grep condor
condor 10677 1 0 23:04 ? 00:00:00
/opt/condor-7.5.0/usr/sbin/condor_master -pidfile
/opt/condor-7.5.0/var/run/condor/condor.pid
condor 10678 10677 0 23:04 ? 00:00:00 condor_startd -f
qing 10820 10360 0 23:25 pts/5 00:00:00 grep condor

4. All work nodes are allowed to read and write the manager

[root@valtical00 spool]# condor_config_val -verbose HOSTALLOW_READ
HOSTALLOW_READ: *.cern.ch
Defined in '/opt/condor-7.5.0/etc/condor/condor_config.local', line 43.

[root@valtical00 condor]# condor_config_val -verbose HOSTALLOW_WRITE
HOSTALLOW_WRITE: *.cern.ch
Defined in '/opt/condor-7.5.0/etc/condor/condor_config.local', line 44

5. The disks is not full:

[root@valtical00 spool]# df -l
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
447080904 40500792 383503248 10% /
/dev/sda1 101086 26536 69331 28% /boot
tmpfs 12337996 0 12337996 0% /dev/shm
/dev/sdb1 1892333360 1762428688 32229052 99% /localdisk
/dev/sdc1 1892333360 1762081812 32575928 99% /localdisk2
/dev/sdd1 1922858352 1256654720 568528032 69% /localdisk3
/dev/sde1 1922858352 1242946828 582235924 69% /localdisk4
/dev/sdg1 1922858352 1792627332 32555420 99% /localdisk5
/dev/sdf1 1922858352 189594508 1635588244 11% /work
/dev/sdh1 1922858352 1757815520 67367232 97% /data5
AFS 9000000 0 9000000 0% /afs

6. Some log of CollectorLog:

12/11/11 23:30:27 (Sending 49 ads in response to query)
12/11/11 23:31:24 NegotiatorAd : Inserting ** "< valtical00.cern.ch >"
12/11/11 23:31:27 (Sending 59 ads in response to query)
12/11/11 23:31:27 Got QUERY_STARTD_PVT_ADS
12/11/11 23:31:27 (Sending 49 ads in response to query)
12/11/11 23:32:27 (Sending 59 ads in response to query)
12/11/11 23:32:27 Got QUERY_STARTD_PVT_ADS
12/11/11 23:32:27 (Sending 49 ads in response to query)
12/11/11 23:33:27 (Sending 59 ads in response to query)
12/11/11 23:33:27 Got QUERY_STARTD_PVT_ADS
12/11/11 23:33:27 (Sending 49 ads in response to query)
12/11/11 23:34:27 (Sending 59 ads in response to query)
12/11/11 23:34:27 Got QUERY_STARTD_PVT_ADS
12/11/11 23:34:27 (Sending 49 ads in response to query)
12/11/11 23:35:27 (Sending 59 ads in response to query)
12/11/11 23:35:27 Got QUERY_STARTD_PVT_ADS
12/11/11 23:35:27 (Sending 49 ads in response to query)

7. Some log of MasterLog:

12/11/11 22:56:24 Setting maximum accepts per cycle 4.
12/11/11 22:56:24 ******************************************************
12/11/11 22:56:24 ** condor_master (CONDOR_MASTER) STARTING UP
12/11/11 22:56:24 ** /opt/condor-7.5.0/usr/sbin/condor_master
12/11/11 22:56:24 ** SubsystemInfo: name=MASTER type=MASTER(2)
class=DAEMON(1)
12/11/11 22:56:24 ** Configuration: subsystem:MASTER local:<NONE>
class:DAEMON
12/11/11 22:56:24 ** $CondorVersion: 7.6.4 Oct 20 2011 BuildID: 379441 $
12/11/11 22:56:24 ** $CondorPlatform: x86_64_rhap_5 $
12/11/11 22:56:24 ** PID = 18441
12/11/11 22:56:24 ** Log last touched time unavailable (No such file or
directory)
12/11/11 22:56:24 ******************************************************
12/11/11 22:56:24 Using config source:
/opt/condor-7.5.0/etc/condor/condor_config
12/11/11 22:56:24 Using local config sources:
12/11/11 22:56:24 /opt/condor-7.5.0/etc/condor/condor_config.local
12/11/11 22:56:24 DaemonCore: command socket at <137.138.40.140:59895>
12/11/11 22:56:24 DaemonCore: private command socket at
<137.138.40.140:59895>
12/11/11 22:56:24 Setting maximum accepts per cycle 4.
12/11/11 22:56:24 Started DaemonCore process
"/opt/condor-7.5.0/usr/sbin/condor_collector", pid and pgroup = 18442
12/11/11 22:56:24 Started DaemonCore process
"/opt/condor-7.5.0/usr/sbin/condor_negotiator", pid and pgroup = 18443
12/11/11 22:56:24 Started DaemonCore process
"/opt/condor-7.5.0/usr/sbin/condor_schedd", pid and pgroup = 18444
12/11/11 22:56:24 Started DaemonCore process
"/opt/condor-7.5.0/usr/sbin/condor_startd", pid and pgroup = 18445

8. Some info in NegotiatorLog:

12/11/11 23:37:27 Phase 1: Obtaining ads from collector ...
12/11/11 23:37:27 Getting all public ads ...
12/11/11 23:37:28 Sorting 59 ads ...
12/11/11 23:37:28 Getting startd private ads ...
12/11/11 23:37:28 Got ads: 59 public and 49 private
12/11/11 23:37:28 Public ads include 0 submitter, 49 startd
12/11/11 23:37:28 Phase 2: Performing accounting ...
12/11/11 23:37:28 Phase 3: Sorting submitter ads by priority ...
12/11/11 23:37:28 Phase 4.1: Negotiating with schedds ...
12/11/11 23:37:28 negotiateWithGroup resources used scheddAds length 0
12/11/11 23:37:28 ---------- Finished Negotiation Cycle ----------
12/11/11 23:38:28 ---------- Started Negotiation Cycle ----------
12/11/11 23:38:28 Phase 1: Obtaining ads from collector ...
12/11/11 23:38:28 Getting all public ads ...
12/11/11 23:38:28 Sorting 59 ads ...
12/11/11 23:38:28 Getting startd private ads ...
12/11/11 23:38:28 Got ads: 59 public and 49 private
12/11/11 23:38:28 Public ads include 0 submitter, 49 startd
12/11/11 23:38:28 Phase 2: Performing accounting ...
12/11/11 23:38:28 Phase 3: Sorting submitter ads by priority ...
12/11/11 23:38:28 Phase 4.1: Negotiating with schedds ...
12/11/11 23:38:28 negotiateWithGroup resources used scheddAds length 0
12/11/11 23:38:28 ---------- Finished Negotiation Cycle ----------

9. some info in SchedLog:

12/11/11 22:56:24 (pid:18444) Setting maximum accepts per cycle 4.
12/11/11 22:56:24 (pid:18444)
******************************************************
12/11/11 22:56:24 (pid:18444) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
12/11/11 22:56:24 (pid:18444) ** /opt/condor-7.5.0/usr/sbin/condor_schedd
12/11/11 22:56:24 (pid:18444) ** SubsystemInfo: name=SCHEDD
type=SCHEDD(5) class=DAEMON(1)
12/11/11 22:56:24 (pid:18444) ** Configuration: subsystem:SCHEDD
local:<NONE> class:DAEMON
12/11/11 22:56:24 (pid:18444) ** $CondorVersion: 7.6.4 Oct 20 2011
BuildID: 379441 $
12/11/11 22:56:24 (pid:18444) ** $CondorPlatform: x86_64_rhap_5 $
12/11/11 22:56:24 (pid:18444) ** PID = 18444
12/11/11 22:56:24 (pid:18444) ** Log last touched time unavailable (No
such file or directory)
12/11/11 22:56:24 (pid:18444)
******************************************************
12/11/11 22:56:24 (pid:18444) Using config source:
/opt/condor-7.5.0/etc/condor/condor_config
12/11/11 22:56:24 (pid:18444) Using local config sources:
12/11/11 22:56:24 (pid:18444)
/opt/condor-7.5.0/etc/condor/condor_config.local
12/11/11 22:56:24 (pid:18444) DaemonCore: command socket at
<137.138.40.140:35736>
12/11/11 22:56:24 (pid:18444) DaemonCore: private command socket at
<137.138.40.140:35736>
12/11/11 22:56:24 (pid:18444) Setting maximum accepts per cycle 4.
12/11/11 22:56:24 (pid:18444) History file rotation is enabled.
12/11/11 22:56:24 (pid:18444) Maximum history file size is: 20971520 bytes
12/11/11 22:56:24 (pid:18444) Number of rotated history files is: 2
12/11/11 22:56:29 (pid:18444) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
12/11/11 23:01:29 (pid:18444) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
12/11/11 23:06:30 (pid:18444) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
12/11/11 23:11:31 (pid:18444) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
12/11/11 23:16:32 (pid:18444) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
12/11/11 23:21:33 (pid:18444) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
12/11/11 23:26:34 (pid:18444) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
12/11/11 23:31:35 (pid:18444) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s
12/11/11 23:36:36 (pid:18444) TransferQueueManager stats: active up=0/10
down=0/10; waiting up=0 down=0; wait time up=0s down=0s

10. Info in StartLog:

12/11/11 22:56:24 Setting maximum accepts per cycle 4.
12/11/11 22:56:24 ******************************************************
12/11/11 22:56:24 ** condor_startd (CONDOR_STARTD) STARTING UP
12/11/11 22:56:24 ** /opt/condor-7.5.0/usr/sbin/condor_startd
12/11/11 22:56:24 ** SubsystemInfo: name=STARTD type=STARTD(7)
class=DAEMON(1)
12/11/11 22:56:24 ** Configuration: subsystem:STARTD local:<NONE>
class:DAEMON
12/11/11 22:56:24 ** $CondorVersion: 7.6.4 Oct 20 2011 BuildID: 379441 $
12/11/11 22:56:24 ** $CondorPlatform: x86_64_rhap_5 $
12/11/11 22:56:24 ** PID = 18445
12/11/11 22:56:24 ** Log last touched time unavailable (No such file or
directory)
12/11/11 22:56:24 ******************************************************
12/11/11 22:56:24 Using config source:
/opt/condor-7.5.0/etc/condor/condor_config
12/11/11 22:56:24 Using local config sources:
12/11/11 22:56:24 /opt/condor-7.5.0/etc/condor/condor_config.local
12/11/11 22:56:24 DaemonCore: command socket at <137.138.40.140:47336>
12/11/11 22:56:24 DaemonCore: private command socket at
<137.138.40.140:47336>
12/11/11 22:56:24 Setting maximum accepts per cycle 4.
12/11/11 22:56:29 VM-gahp server reported an internal error
12/11/11 22:56:29 VM universe will be tested to check if it is available
12/11/11 22:56:29 History file rotation is enabled.
12/11/11 22:56:29 Maximum history file size is: 20971520 bytes
12/11/11 22:56:29 Number of rotated history files is: 2
12/11/11 22:56:29 New machine resource allocated
12/11/11 22:56:29 CronJobList: Adding job 'mips'
12/11/11 22:56:29 CronJobList: Adding job 'kflops'
12/11/11 22:56:29 CronJob: Initializing job 'mips'
(/opt/condor-7.5.0/usr/libexec/condor/condor_mips)
12/11/11 22:56:29 CronJob: Initializing job 'kflops'
(/opt/condor-7.5.0/usr/libexec/condor/condor_kflops)
12/11/11 22:56:29 State change: IS_OWNER is false
12/11/11 22:56:29 Changing state: Owner -> Unclaimed
12/11/11 22:56:29 State change: RunBenchmarks is TRUE
12/11/11 22:56:29 Changing activity: Idle -> Benchmarking
12/11/11 22:56:29 BenchMgr:StartBenchmarks()
12/11/11 22:56:53 State change: benchmarks completed
12/11/11 22:56:53 Changing activity: Benchmarking -> Idle

Any idea where the problem is?

Cheers,Gang