[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor Status Won't Work on Other Nodes Other than the Central Manager



HI all,

I hope someone might have stumbled on this kind of problem before...and
solved it. :)

I have installed the Condor 6.7.20 from VDT 1.3.11 on two SuSE 9.3 and one
SuSE 10.0 linux machines. I followed the installation instructions at the
http://condor.optena.com/display/CONDOR site.

The condor_status works fine on the central manager (physdg03 machine) and
it has the following results:

######################################################################
condor@physdg03:~/local/log> condor_master

condor@physdg03:~/local/log> ps -ef |grep condor_
condor   12045     1  0 23:32 ?        00:00:00 condor_master
condor   12046 12045  0 23:32 ?        00:00:00 condor_collector -f
condor   12047 12045  0 23:32 ?        00:00:00 condor_negotiator -f
condor   12048 12045  0 23:32 ?        00:00:04 condor_startd -f
condor   12051 12045  0 23:32 ?        00:00:00 condor_schedd -f
condor   12182 11767  0 23:58 pts/0    00:00:00 grep condor_

condor@physdg03:~/local/log> condor_q

-- Submitter: physdg03.msuiit.edu.ph : <203.177.109.173:1159> :
physdg03.msuiit.edu.ph
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

condor@physdg03:~/local/log> condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem  
ActvtyTime

physdg03.msui LINUX       INTEL  Unclaimed  Idle       0.310   250 
0+00:00:04

                     Total Owner Claimed Unclaimed Matched Preempting
Backfill

         INTEL/LINUX     1     0       0         1       0          0     
  0

               Total     1     0       0         1       0          0     
  0
########################################################################

...the following are log files that have a failure or warning for the
physdg03 machine (the central manager): (I did not include the other log
files because I think they are of no much help 'cause there is no errors
in them)

COLLECTORLOG
################################################################
8/22 23:32:43 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
8/22 23:32:43 ** /home/condor/vdt/condor/sbin/condor_collector
8/22 23:32:43 ** $CondorVersion: 6.7.20 Jun 21 2006 $
8/22 23:32:43 ** $CondorPlatform: I386-LINUX_RH9 $
8/22 23:32:43 ** PID = 12046
8/22 23:32:43 ** Log last touched time unavailable (No such file or
directory)
8/22 23:32:43 ******************************************************
8/22 23:32:43 Using config source: /home/condor/vdt/condor/etc/condor_config
8/22 23:32:43 Using local config sources:
8/22 23:32:43    /home/condor/local/condor_config.local
8/22 23:32:43 DaemonCore: Command Socket at <203.177.109.173:9618>
8/22 23:32:43 In ViewServer::Init()
8/22 23:32:43 In CollectorDaemon::Init()
8/22 23:32:43 In ViewServer::Config()
8/22 23:32:43 In CollectorDaemon::Config()
8/22 23:32:43 enable: Creating stats hash table
8/22 23:32:43 WARNING:  No master ad for < physdg03.msuiit.edu.ph >
8/22 23:32:43 ScheddAd     : Inserting ** "< physdg03.msuiit.edu.ph ,
203.177.10
9.173 >"
8/22 23:32:43 stats: Inserting new hashent for
'Schedd':'physdg03.msuiit.edu.ph'
:'203.177.109.173'
8/22 23:32:43 (Sending 1 ads in response to query)
8/22 23:32:43 Got QUERY_STARTD_PVT_ADS
8/22 23:32:43 (Sending 0 ads in response to query)
8/22 23:32:43 NegotiatorAd  : Inserting ** "< physdg03.msuiit.edu.ph >"
8/22 23:32:43 stats: Inserting new hashent for
'Negotiator':'physdg03.msuiit.edu
.ph':'203.177.109.173'
8/22 23:32:48 ** Master < physdg03.msuiit.edu.ph > rejuvenated from
recently dow
n
8/22 23:32:48 stats: Inserting new hashent for
'Master':'physdg03.msuiit.edu.ph'
:'203.177.109.173'
8/22 23:32:59 StartdAd     : Inserting ** "< physdg03.msuiit.edu.ph ,
203.177.10
9.173 >"
8/22 23:32:59 stats: Inserting new hashent for
'Start':'physdg03.msuiit.edu.ph':
'203.177.109.173'
8/22 23:32:59 StartdPvtAd  : Inserting ** "< physdg03.msuiit.edu.ph ,
203.177.10
9.173 >"
8/22 23:32:59 stats: Inserting new hashent for
'StartdPvt':'physdg03.msuiit.edu.
ph':'203.177.109.173'
8/22 23:35:04 Got QUERY_STARTD_ADS
8/22 23:35:04 (Sending 1 ads in response to query)
8/22 23:37:43 (Sending 4 ads in response to query)
8/22 23:37:43 Got QUERY_STARTD_PVT_ADS
8/22 23:37:43 (Sending 1 ads in response to query)
8/22 23:37:43 NegotiatorAd  : Inserting ** "< physdg03.msuiit.edu.ph >"
#######################################################################


SCHEDLOG (central manager)
###############################################################
8/22 23:32:43 (pid:12051) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
8/22 23:32:43 (pid:12051) ** /home/condor/vdt/condor/sbin/condor_schedd
8/22 23:32:43 (pid:12051) ** $CondorVersion: 6.7.20 Jun 21 2006 $
8/22 23:32:43 (pid:12051) ** $CondorPlatform: I386-LINUX_RH9 $
8/22 23:32:43 (pid:12051) ** PID = 12051
8/22 23:32:43 (pid:12051) ** Log last touched time unavailable (No such
file or
directory)
8/22 23:32:43 (pid:12051)
******************************************************
8/22 23:32:43 (pid:12051) Using config source:
/home/condor/vdt/condor/etc/condo
r_config
8/22 23:32:43 (pid:12051) Using local config sources:
8/22 23:32:43 (pid:12051)    /home/condor/local/condor_config.local
8/22 23:32:43 (pid:12051) DaemonCore: Command Socket at
<203.177.109.173:1159>
8/22 23:32:43 (pid:12051) History file rotation is enabled.
8/22 23:32:43 (pid:12051)   Maximum history file size is: 20971520 bytes
8/22 23:32:43 (pid:12051)   Number of rotated history files is: 2
8/22 23:35:00 (pid:12051) IO: Failed to read packet header
######################################################################


On the other machines (physdg02 and physdg01), I do condor_init and
condor_master and then try some test (condor_q and condor_status). I got
the following for physdg01 machine (this is also the same on physdg02):

############################################################
condor@physdg01:~/local/log> condor_init
/home/condor/condor_config already exists.
/home/condor/condor_config already exists.
/home/condor/condor_config already exists.
/home/condor/local/log already exists.
/home/condor/local/spool already exists.
/home/condor/local/execute already exists.
/home/condor/local/condor_config.local already exists.
Condor has been initialized, but not started.

condor@physdg01:~/local/log> condor_master

condor@physdg01:~/local/log> ps -ef |grep condor_
condor   13273     1  0 23:45 ?        00:00:00 condor_master
condor   13274 13273 24 23:45 ?        00:00:02 condor_startd -f
condor   13275 13273  0 23:45 ?        00:00:00 condor_schedd -f
condor   13291 12957  0 23:45 pts/2    00:00:00 grep condor_

condor@physdg01:~/local/log> condor_q

-- Submitter: physdg01.msuiit.edu.ph : <203.177.109.170:1403> :
physdg01.msuiit.edu.ph
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

condor@physdg01:~/local/log> condor_status

condor@physdg01:~/local/log>
###############################################################

...I GOT NOTHING WITH THE condor_status. I am expecting that it should
print both the two machines, physdg03 and physdg01.

I have the following log files for this node:(physdg01)

MASTERLOG
##############################################################
8/22 23:45:07 ** condor_master (CONDOR_MASTER) STARTING UP
8/22 23:45:07 ** /home/condor/vdt/condor/sbin/condor_master
8/22 23:45:07 ** $CondorVersion: 6.7.20 Jun 21 2006 $
8/22 23:45:07 ** $CondorPlatform: I386-LINUX_RH9 $
8/22 23:45:07 ** PID = 13273
8/22 23:45:07 ** Log last touched 8/22 23:32:06
8/22 23:45:07 ******************************************************
8/22 23:45:07 Using config source: /home/condor/vdt/condor/etc/condor_config
8/22 23:45:07 Using local config sources:
8/22 23:45:07    /home/condor/local/condor_config.local
8/22 23:45:07 DaemonCore: Command Socket at <203.177.109.170:1401>
8/22 23:45:07 Started DaemonCore process
"/home/condor/vdt/condor/sbin/condor_st
artd", pid and pgroup = 13274
8/22 23:45:07 Started DaemonCore process
"/home/condor/vdt/condor/sbin/condor_sc
hedd", pid and pgroup = 13275
8/22 23:45:33 attempt to connect to <203.177.109.170:1407> timed out
8/22 23:45:33 ERROR: SECMAN:2003:TCP connection to <203.177.109.170:1407>
failed

8/22 23:45:33 Failed to start non-blocking update to <203.177.109.170:1268>.
8/22 23:50:33 attempt to connect to <203.177.109.170:1414> timed out
8/22 23:50:33 ERROR: SECMAN:2003:TCP connection to <203.177.109.170:1414>
failed

8/22 23:50:33 Failed to start non-blocking update to <203.177.109.170:1270>.
######################################################################

SCHEDLOG
#####################################################################
8/22 16:00:23 (pid:9866)
******************************************************
8/22 16:00:23 (pid:9866) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
8/22 16:00:23 (pid:9866) ** /home/condor/vdt/condor/sbin/condor_schedd
8/22 16:00:23 (pid:9866) ** $CondorVersion: 6.7.20 Jun 21 2006 $
8/22 16:00:23 (pid:9866) ** $CondorPlatform: I386-LINUX_RH9 $
8/22 16:00:23 (pid:9866) ** PID = 9866
8/22 16:00:23 (pid:9866) ** Log last touched time unavailable (No such
file or d
irectory)
8/22 16:00:23 (pid:9866)
******************************************************
8/22 16:00:23 (pid:9866) Using config source:
/home/condor/vdt/condor/etc/condor
_config
8/22 16:00:23 (pid:9866) Using local config sources:
8/22 16:00:23 (pid:9866)    /home/condor/local/condor_config.local
8/22 16:00:23 (pid:9866) DaemonCore: Command Socket at <203.177.109.170:1349>
8/22 16:00:23 (pid:9866) History file rotation is enabled.
8/22 16:00:23 (pid:9866)   Maximum history file size is: 20971520 bytes
8/22 16:00:23 (pid:9866)   Number of rotated history files is: 2
8/22 16:00:45 (pid:9866) attempt to connect to <203.177.109.170:1351>
timed out
8/22 16:00:45 (pid:9866) ERROR: SECMAN:2003:TCP connection to
<203.177.109.170:1
351> failed

8/22 16:00:45 (pid:9866) Failed to start non-blocking update to
<203.177.109.170
:1236>.
########################################################################

STARTLOG
########################################################################
8/22 23:45:07 ******************************************************
8/22 23:45:07 ** condor_startd (CONDOR_STARTD) STARTING UP
8/22 23:45:07 ** /home/condor/vdt/condor/sbin/condor_startd
8/22 23:45:07 ** $CondorVersion: 6.7.20 Jun 21 2006 $
8/22 23:45:07 ** $CondorPlatform: I386-LINUX_RH9 $
8/22 23:45:07 ** PID = 13274
8/22 23:45:07 ** Log last touched 8/22 23:32:21
8/22 23:45:07 ******************************************************
8/22 23:45:07 Using config source: /home/condor/vdt/condor/etc/condor_config
8/22 23:45:07 Using local config sources:
8/22 23:45:07    /home/condor/local/condor_config.local
8/22 23:45:07 DaemonCore: Command Socket at <203.177.109.170:1402>
8/22 23:45:14 New machine resource allocated
8/22 23:45:14 About to run initial benchmarks.
8/22 23:45:19 Completed initial benchmarks.
8/22 23:45:19 State change: IS_OWNER is false
8/22 23:45:19 Changing state: Owner -> Unclaimed
8/22 23:45:44 attempt to connect to <203.177.109.170:1409> timed out
8/22 23:45:44 ERROR: SECMAN:2003:TCP connection to <203.177.109.170:1409>
failed

8/22 23:45:44 Failed to start non-blocking update to <203.177.109.170:1269>.
8/22 23:50:44 attempt to connect to <203.177.109.170:1415> timed out
8/22 23:50:44 ERROR: SECMAN:2003:TCP connection to <203.177.109.170:1415>
failed

8/22 23:50:44 Failed to start non-blocking update to <203.177.109.170:1272>.
8/22 23:55:44 attempt to connect to <203.177.109.170:1418> timed out
8/22 23:55:44 ERROR: SECMAN:2003:TCP connection to <203.177.109.170:1418>
failed

8/22 23:55:44 Failed to start non-blocking update to <203.177.109.170:1278>.
########################################################################

Please I really need this, badly needed, to be solved ASAP.

Thanks in advance...


Leo Cristobal C. Ambolode II
Physics Department, MSU-IIT, PHILIPPINES