[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Getting CentOS6 node into CO7 cluster



Hi,

Apologies if this isn't the appropriate channel; my first post.

I have 1 master and 10 nodes all on CentOS7, HTCondor 8.6.10

I have to keep SL6.9 on this particular machine and want to include it in the cluster
condor_status shows the SL6.9 machine threads as available, but never actually claims them (the job does run outside of condor on the SL6.9).

slot8@fastpc30   LINUX      X86_64 Claimed   Busy      0.730 1970  0+00:00:03
slot1@fastpc31   LINUX      X86_64 Unclaimed Idle      0.610 1994  0+00:44:37
[...]

condor_q -better-analyze 6750

6750.1069:  Run analysis summary ignoring user priority.  Of 252 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
    244 match and are already running your jobs
      0 match but are serving other users
      0 are available to run your job

----------------------
On the SL6.9 machine
----------------------

cat /var/log/messages | grep condor

Aug  7 14:34:34 fastpc31 yum[1962]: Installed: condor-8.6.10-1.el6.x86_64
Aug  7 14:35:21 fastpc31 htcondor: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (1606869).
Aug  7 14:35:21 fastpc31 htcondor: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (1024).
Aug  7 14:35:21 fastpc31 htcondor: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000).
Aug  7 14:35:21 fastpc31 htcondor: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000) <= old value (25000000).
Aug  7 14:35:21 fastpc31 htcondor: Changing FS_CACHE_DIRTY_BYTES (/proc/sys/vm/dirty_bytes) from 100000000 to 100000000
Aug  7 14:35:21 fastpc31 htcondor: Not changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max): new value (10485760) <= old value (10485760).

Version     : 8.6.10 (Installed same version as host, as had this issue with 8.7.9)
id condor
uid=990(condor) gid=985(condor) groups=985(condor)
(same for all in cluster)

On SL6.9 (kernel 2.6.32-754.2.1.el6.x86_64) node:

service condor status
condor_master (pid  2004) is running....

cat /var/log/condor/MasterLog

08/07/18 14:35:21 ******************************************************
08/07/18 14:35:21 ** condor_master (CONDOR_MASTER) STARTING UP
08/07/18 14:35:21 ** /usr/sbin/condor_master
08/07/18 14:35:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
08/07/18 14:35:21 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
08/07/18 14:35:21 ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $
08/07/18 14:35:21 ** $CondorPlatform: x86_64_RedHat6 $
08/07/18 14:35:21 ** PID = 2004
08/07/18 14:35:21 ** Log last touched 8/7 14:22:46
08/07/18 14:35:21 ******************************************************
08/07/18 14:35:21 Using config source: /etc/condor/condor_config
08/07/18 14:35:21 Using local config sources:
08/07/18 14:35:21    /etc/condor/config.d/condor_execute_fastpc31.config
08/07/18 14:35:21    /etc/condor/condor_config.local
08/07/18 14:35:21 config Macros = 74, Sorted = 74, StringBytes = 1852, TablesBytes = 2712
08/07/18 14:35:21 CLASSAD_CACHING is OFF
08/07/18 14:35:21 Daemon Log is logging: D_ALWAYS D_ERROR
08/07/18 14:35:22 SharedPortEndpoint: waiting for connections to named socket 2004_7849
08/07/18 14:35:22 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
08/07/18 14:35:22 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
08/07/18 14:35:22 DaemonCore: private command socket at <10.0.0.31:0?sock=2004_7849>
08/07/18 14:35:22 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
08/07/18 14:35:22 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1520893905)
08/07/18 14:35:22 Collector port not defined, will use default: 9618
08/07/18 14:35:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 2037
08/07/18 14:35:22 Waiting for /var/lock/condor/shared_port_ad to appear.
08/07/18 14:35:23 Found /var/lock/condor/shared_port_ad.
08/07/18 14:35:23 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 2038
08/07/18 14:35:33 Setting ready state 'Ready' for STARTD

Which looks OK to me. Does anyone have suggestions?

Thanks,
Ben
----------------------------------------------------------------------------
   Ben Pietras <ben.pietras@xxxxxxxxxxxxxxxx>
   School of Physics and Astronomy,   Tel.  0161-275-4231
   The University of Manchester,          Fax. 0161-275-5509
   Manchester, M13 9PL.
----------------------------------------------------------------------------