[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] problems with Solaris 10



Title: problems with Solaris 10

Hi all,

I am building a cluster with solaris machines. So far I have 10+ machines with solaris 8 running without problems. My problems come when trying to include solaris 10 machines.

All machines share a condor user directory via NFS and therefore I have created a directory with subdirectories bin sbin  libexec and lib both for solaris 8 and 10 and I define in each machines config file to use either directory, depending on the architecture.

The Solaris 10 machines start up the condor daemons without a problem, but the traces give me an error ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed

A piece of the logs are:

================ MasterLog ==========================

3/26 18:46:01 ProcAPI::buildFamily() Found daddypid on the system: 799

3/26 18:46:08 ProcAPI::buildFamily() Found daddypid on the system: 800

3/26 18:46:42 Getting monitoring info for pid 798

3/26 18:46:45 enter Daemons::UpdateCollector

3/26 18:46:45 Trying to update collector <10.95.5.97:9618>

3/26 18:46:45 Attempting to send update via UDP to collector vitorino.hi.inet <10.95.5.97:9618>

3/26 18:46:45 exit Daemons::UpdateCollector

3/26 18:46:52 enter Daemons::CheckForNewExecutable

3/26 18:46:52 Time stamp of running /home/usu/condor/condor_5.10/sbin/condor_master: 1167915061

3/26 18:46:52 GetTimeStamp returned: 1167915061

3/26 18:46:52 Time stamp of running /home/usu/condor/condor_5.10/sbin/condor_schedd: 1167915026

3/26 18:46:52 GetTimeStamp returned: 1167915026

3/26 18:46:52 Time stamp of running /home/usu/condor/condor_5.10/sbin/condor_startd: 1167915022

3/26 18:46:52 GetTimeStamp returned: 1167915022

3/26 18:46:52 exit Daemons::CheckForNewExecutable

3/26 18:47:01 ProcAPI::buildFamily() Found daddypid on the system: 799

3/26 18:47:06 attempt to connect to <10.95.5.97:9618> failed: Failed to set timeout..

3/26 18:47:06 ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed

3/26 18:47:06 Failed to start non-blocking update to <10.95.5.97:9618>.

================= SchedLog =======================

3/26 18:37:26 (pid:799) attempt to connect to <10.95.5.97:9618> failed: Failed to set timeout..

3/26 18:37:26 (pid:799) ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed

3/26 18:37:26 (pid:799) Failed to start non-blocking update to <10.95.5.97:9618>.

3/26 18:38:51 (pid:799) Getting monitoring info for pid 799

3/26 18:42:05 (pid:799) JobsRunning = 0

3/26 18:42:05 (pid:799) JobsIdle = 0

3/26 18:42:05 (pid:799) JobsHeld = 0

3/26 18:42:05 (pid:799) JobsRemoved = 0

3/26 18:42:05 (pid:799) LocalUniverseJobsRunning = 0

3/26 18:42:05 (pid:799) LocalUniverseJobsIdle = 0

3/26 18:42:05 (pid:799) SchedUniverseJobsRunning = 0

3/26 18:42:06 (pid:799) SchedUniverseJobsIdle = 0

3/26 18:42:06 (pid:799) N_Owners = 0

3/26 18:42:06 (pid:799) MaxJobsRunning = 200

3/26 18:42:06 (pid:799) Trying to update collector <10.95.5.97:9618>

3/26 18:42:06 (pid:799) Attempting to send update via UDP to collector vitorino.hi.inet <10.95.5.97:9618>

3/26 18:42:06 (pid:799) Sent HEART BEAT ad to 1 collectors. Number of submittors=0

3/26 18:42:06 (pid:799) ============ Begin clean_shadow_recs =============

3/26 18:42:06 (pid:799) ============ End clean_shadow_recs =============

3/26 18:42:06 (pid:799) -------- Begin starting jobs --------

3/26 18:42:06 (pid:799) -------- Done starting jobs --------

3/26 18:42:27 (pid:799) attempt to connect to <10.95.5.97:9618> failed: Failed to set timeout..

3/26 18:42:27 (pid:799) ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed

================= StartLog =====================================

3/26 18:42:35 Failed to start non-blocking update to <10.95.5.97:9618>.

3/26 18:43:10 Getting monitoring info for pid 800

3/26 18:45:10 DaemonCore: in SendAliveToParent()

3/26 18:45:10 DaemonCore: attempting to connect to '<10.95.109.196:32853>'

3/26 18:47:10 Swap space: 818600

3/26 18:47:10 3635528 kbytes available for "/home/usu/condor/hosts/kang/execute"

3/26 18:47:10 Looking up RESERVED_DISK parameter

3/26 18:47:10 Reserving 5120 kbytes for file system

3/26 18:47:10 Disk space: 3630408

3/26 18:47:10 State change: IS_OWNER is TRUE

3/26 18:47:10 Changing state: Unclaimed -> Owner

3/26 18:47:11 Getting monitoring info for pid 800

3/26 18:47:15 Trying to update collector <10.95.5.97:9618>

3/26 18:47:15 Attempting to send update via UDP to collector vitorino.hi.inet <10.95.5.97:9618>

3/26 18:47:15 Sent update to 1 collector(s)

3/26 18:47:36 attempt to connect to <10.95.5.97:9618> failed: Failed to set timeout..

3/26 18:47:36 ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed

3/26 18:47:36 Failed to start non-blocking update to <10.95.5.97:9618>.


A bit more info: the solaris 10 machines can execute condor_status and give a list of all other machines, but they do not appear there.

Thanks a lot for any help you can give me