[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems running on CentOS 5.5



Has anyone else had problems with Condor running on CentOS 5.5.  We have a
mixture of machines running CentOS 5.3 (Intel x86_64) and 5.5 (AMD x86_64).
I cannot get Condor to accept a job and run without SEG Faulting on my
CentOS 5.5 machines. All the config files on both sets of machines are the
same. Condor starts fine on the CentOS 5.5 machines, but fails when
accepting jobs. I just cannot figure out what I am missing.

Below is what I am seeing the StarterLog.slot1 (all the other slots look the
same).  The system is running CentOS 5.5 on an AMD x64 architecture.

All my other machines in the grid are Intel x64 architecture, but that
shouldn't matter. They run fine, but using the exact same version of Condor
(7.4.4) and Java 1.6.0_21 I have no problems.  What am I missing? Condor is
a local user (not LDAP).  I tried reducing the number of slots from 12 to 6
and playing with numbers in between and nothing.


StarterLog.slot1

03/10 15:46:43 ******************************************************
03/10 15:46:43 ** condor_starter (CONDOR_STARTER) STARTING UP
03/10 15:46:43 ** /opt/condor-7.4.4/sbin/condor_starter
03/10 15:46:43 ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
03/10 15:46:43 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
03/10 15:46:43 ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
03/10 15:46:43 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
03/10 15:46:43 ** PID = 2498
03/10 15:46:43 ** Log last touched 3/10 11:38:23
03/10 15:46:43 ******************************************************
03/10 15:46:43 Using config source: /opt/condor-7.4.4/etc/condor_config
03/10 15:46:43 Using local config sources:
03/10 15:46:43    /opt/condor-7.4.4/local.kfc30/condor_config.local
03/10 15:46:43 DaemonCore: Command Socket at <10.0.22.95:56532>
03/10 15:46:43 Done setting resource limits
03/10 15:46:43 Communicating with shadow <10.0.22.58:60423>
03/10 15:46:43 Submitting machine is "xxx.jhu.edu"
03/10 15:46:43 setting the orig job name in starter
03/10 15:46:43 setting the orig job iwd in starter
Stack dump for process 2498 at timestamp 1299790003 (4 frames)
[0x474344]
[0x464fa8]
[0x472010]
/lib64/libnss_ldap.so.2(_nss_ldap_inc_depth+0xc)[0x2b3d901995ac]


StartLog

03/10 15:46:43 slot3: Changing activity: Busy -> Idle
03/10 15:47:12 Aborting CA_LOCATE_STARTER
03/10 15:47:12 ClaimId
(<10.0.22.95:33055>#1299789952#2#852ae140f108488b8939c4cd389765209e760b23)
and GlobalJobId ( xxx.jhu.edu#3382.4#1299688357 ) not found
03/10 15:47:42 Aborting CA_LOCATE_STARTER
03/10 15:47:42 ClaimId
(<10.0.22.95:33055>#1299789952#4#5819ff7541cde00d3c92402f8386edbc69932829)
and GlobalJobId ( xxx.jhu.edu#3382.6#1299688357 ) not found
03/10 15:48:58 Got SIGQUIT.  Performing fast shutdown.
03/10 15:48:58 shutdown fast
03/10 15:48:58 slot1: Changing state and activity: Claimed/Idle ->
Preempting/Killing
03/10 15:48:58 slot1: State change: No preempting claim, returning to owner
03/10 15:48:58 slot1: Changing state and activity: Preempting/Killing ->
Owner/Idle
03/10 15:48:58 slot1: State change: IS_OWNER is false
03/10 15:48:58 slot1: Changing state: Owner -> Unclaimed
03/10 15:48:58 slot2: Changing state and activity: Claimed/Idle ->
Preempting/Killing
03/10 15:48:58 slot2: State change: No preempting claim, returning to owner
03/10 15:48:58 slot2: Changing state and activity: Preempting/Killing ->
Owner/Idle
03/10 15:48:58 slot2: State change: IS_OWNER is false


MasterLog

03/10 15:45:52 ******************************************************
03/10 15:45:52 ** condor_master (CONDOR_MASTER) STARTING UP
03/10 15:45:52 ** /opt/condor-7.4.4/sbin/condor_master
03/10 15:45:52 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
03/10 15:45:52 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
03/10 15:45:52 ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $
03/10 15:45:52 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
03/10 15:45:52 ** PID = 2430
03/10 15:45:52 ** Log last touched 3/10 11:38:23
03/10 15:45:52 ******************************************************
03/10 15:45:52 Using config source: /opt/condor-7.4.4/etc/condor_config
03/10 15:45:52 Using local config sources:
03/10 15:45:52    /opt/condor-7.4.4/local.kfc30/condor_config.local
03/10 15:45:52 DaemonCore: Command Socket at <10.0.22.95:49090>
03/10 15:45:52 Started DaemonCore process
"/opt/condor-7.4.4/sbin/condor_schedd", pid and pgroup = 2431
03/10 15:45:52 Started DaemonCore process
"/opt/condor-7.4.4/sbin/condor_startd", pid and pgroup = 2432
03/10 15:48:58 Got SIGQUIT.  Performing fast shutdown.
03/10 15:48:58 Sent SIGQUIT to SCHEDD (pid 2431)
03/10 15:48:58 Sent SIGQUIT to STARTD (pid 2432)
03/10 15:48:58 The SCHEDD (pid 2431) exited with status 0
03/10 15:48:58 The STARTD (pid 2432) exited with status 0
03/10 15:48:58 All daemons are gone.  Exiting.
03/10 15:48:58 **** condor_master (condor_MASTER) pid 2430 EXITING WITH
STATUS 0

SchedLog

03/10 15:45:52 (pid:2431)
******************************************************
03/10 15:45:52 (pid:2431) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
03/10 15:45:52 (pid:2431) ** /opt/condor-7.4.4/sbin/condor_schedd
03/10 15:45:52 (pid:2431) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5)
class=DAEMON(1)
03/10 15:45:52 (pid:2431) ** Configuration: subsystem:SCHEDD local:<NONE>
class:DAEMON
03/10 15:45:52 (pid:2431) ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID:
279383 $
03/10 15:45:52 (pid:2431) ** $CondorPlatform: X86_64-LINUX_RHEL5 $
03/10 15:45:52 (pid:2431) ** PID = 2431
03/10 15:45:52 (pid:2431) ** Log last touched 3/10 11:38:23
03/10 15:45:52 (pid:2431)
******************************************************
03/10 15:45:52 (pid:2431) Using config source:
/opt/condor-7.4.4/etc/condor_config
03/10 15:45:52 (pid:2431) Using local config sources:
03/10 15:45:52 (pid:2431)
/opt/condor-7.4.4/local.kfc30/condor_config.local
03/10 15:45:52 (pid:2431) DaemonCore: Command Socket at <10.0.22.95:53572>
03/10 15:45:52 (pid:2431) History file rotation is enabled.
03/10 15:45:52 (pid:2431)   Maximum history file size is: 20971520 bytes
03/10 15:45:52 (pid:2431)   Number of rotated history files is: 2
03/10 15:48:58 (pid:2431) Got SIGQUIT.  Performing fast shutdown.
03/10 15:48:58 (pid:2431) All shadows have been killed, exiting.
03/10 15:48:58 (pid:2431) **** condor_schedd (condor_SCHEDD) pid 2431
EXITING WITH STATUS 0

Bob
--

Robert V. Sigillito
Johns Hopkins Applied Physics Lab
11100 Johns Hopkins Road
Laurel, Maryland 20723
(240) 228-8468
(443) 778-8468