[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] problems with startup and executing a job



Dear all, I got two problems:

~~~~~~~~~~~~~~~~~~~~~~~Problem 1~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After starting condor_master as root on all machines in the pool, the
MasterLog on central-manager looks ok, but that on the other machine has
problem:

10/20 09:48:17 ******************************************************
10/20 09:48:17 ** condor_master (CONDOR_MASTER) STARTING UP
10/20 09:48:17 ** /home/condor/condor/sbin/condor_master
10/20 09:48:17 ** $CondorVersion: 6.8.1 Sep 17 2006  $
10/20 09:48:17 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/20 09:48:17 ** PID = 2768
10/20 09:48:17 ** Log last touched 10/18 17:52:01
10/20 09:48:17 ******************************************************
10/20 09:48:17 Using config
source: /home/condor/condor/etc/condor_config
10/20 09:48:17 Using local config sources:
10/20 09:48:17    /home/condor/condor_config.local
10/20 09:48:17 DaemonCore: Command Socket at <129.254.187.125:42587>
10/20 09:48:17 Started DaemonCore process
"/home/condor/condor/sbin/condor_startd", pid and pgroup = 2769
10/20 09:48:18 Started DaemonCore process
"/home/condor/condor/sbin/condor_schedd", pid and pgroup = 2770
10/20 09:48:23 attempt to connect to <129.254.187.125:9618> failed:
Connection refused (connect errno = 111).
10/20 09:48:23 ERROR: SECMAN:2003:TCP connection to
<129.254.187.125:9618> failed

10/20 09:48:23 Failed to start non-blocking update to
<129.254.187.125:9618>.


The IP address above is the local machine's IP, should it be? Can
anybody give hints for the failed connection?


Just now I restart condor with condor_master, the MasterLog changed:

10/20 10:54:48 ******************************************************
10/20 10:54:48 ** condor_master (CONDOR_MASTER) STARTING UP
10/20 10:54:48 ** /home/condor/condor/sbin/condor_master
10/20 10:54:48 ** $CondorVersion: 6.8.1 Sep 17 2006  $
10/20 10:54:48 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/20 10:54:48 ** PID = 3527
10/20 10:54:48 ** Log last touched 10/20 10:54:18
10/20 10:54:48 ******************************************************
10/20 10:54:48 Using config
source: /home/condor/condor/etc/condor_config
10/20 10:54:48 Using local config sources:
10/20 10:54:48    /home/condor/condor_config.local
10/20 10:54:48 FileLock::obtain(1) failed - errno 11 (Resource
temporarily unavailable) 10/20 10:54:48 ERROR "Can't get lock on
"/home/condor/log/InstanceLock"" at line 976 in file master.C


~~~~~~~~~~~~~~~~~~~~~~~~~Problem 2~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Given Problem 1 not solved, I submit jobs on the central-manager, all
the jobs are kept idle, no execution. The jobs' logs contain only:

000 (007.000.000) 10/20 10:26:01 Job submitted from host:
<129.254.175.78:46913>
...

Condor is installed with all manager/submit/execute functions on
central-manager, I cannot solve what may cause this happening!


Thanks,