[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Getting CentOS6 node into CO7 cluster



Hi Ben,

have you tried, if it works when you explicitly request a SL6 node?
   ...
   requirements   = OpSysAndVer == "SL6"
   ...

Cheers,
  Thomas

On 2018-08-08 10:27, Ben Pietras wrote:
> Hi,
> 
> Apologies if this isn't the appropriate channel; my first post.
> 
> I have 1 master and 10 nodes all on CentOS7, HTCondor 8.6.10
> 
> I have to keep SL6.9 on this particular machine and want to include it in the cluster
> condor_status shows the SL6.9 machine threads as available, but never actually claims them (the job does run outside of condor on the SL6.9).
> 
> slot8@fastpc30   LINUX      X86_64 Claimed   Busy      0.730 1970  0+00:00:03
> slot1@fastpc31   LINUX      X86_64 Unclaimed Idle      0.610 1994  0+00:44:37
> [...]
> 
> condor_q -better-analyze 6750
> 
> 6750.1069:  Run analysis summary ignoring user priority.  Of 252 machines,
>       0 are rejected by your job's requirements
>       0 reject your job because of their own requirements
>     244 match and are already running your jobs
>       0 match but are serving other users
>       0 are available to run your job
> 
> ----------------------
> On the SL6.9 machine
> ----------------------
> 
> cat /var/log/messages | grep condor
> 
> Aug  7 14:34:34 fastpc31 yum[1962]: Installed: condor-8.6.10-1.el6.x86_64
> Aug  7 14:35:21 fastpc31 htcondor: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (1606869).
> Aug  7 14:35:21 fastpc31 htcondor: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (1024).
> Aug  7 14:35:21 fastpc31 htcondor: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000).
> Aug  7 14:35:21 fastpc31 htcondor: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000) <= old value (25000000).
> Aug  7 14:35:21 fastpc31 htcondor: Changing FS_CACHE_DIRTY_BYTES (/proc/sys/vm/dirty_bytes) from 100000000 to 100000000
> Aug  7 14:35:21 fastpc31 htcondor: Not changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max): new value (10485760) <= old value (10485760).
> 
> Version     : 8.6.10 (Installed same version as host, as had this issue with 8.7.9)
> id condor
> uid=990(condor) gid=985(condor) groups=985(condor)
> (same for all in cluster)
> 
> On SL6.9 (kernel 2.6.32-754.2.1.el6.x86_64) node:
> 
> service condor status
> condor_master (pid  2004) is running....
> 
> cat /var/log/condor/MasterLog
> 
> 08/07/18 14:35:21 ******************************************************
> 08/07/18 14:35:21 ** condor_master (CONDOR_MASTER) STARTING UP
> 08/07/18 14:35:21 ** /usr/sbin/condor_master
> 08/07/18 14:35:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
> 08/07/18 14:35:21 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
> 08/07/18 14:35:21 ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $
> 08/07/18 14:35:21 ** $CondorPlatform: x86_64_RedHat6 $
> 08/07/18 14:35:21 ** PID = 2004
> 08/07/18 14:35:21 ** Log last touched 8/7 14:22:46
> 08/07/18 14:35:21 ******************************************************
> 08/07/18 14:35:21 Using config source: /etc/condor/condor_config
> 08/07/18 14:35:21 Using local config sources:
> 08/07/18 14:35:21    /etc/condor/config.d/condor_execute_fastpc31.config
> 08/07/18 14:35:21    /etc/condor/condor_config.local
> 08/07/18 14:35:21 config Macros = 74, Sorted = 74, StringBytes = 1852, TablesBytes = 2712
> 08/07/18 14:35:21 CLASSAD_CACHING is OFF
> 08/07/18 14:35:21 Daemon Log is logging: D_ALWAYS D_ERROR
> 08/07/18 14:35:22 SharedPortEndpoint: waiting for connections to named socket 2004_7849
> 08/07/18 14:35:22 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
> 08/07/18 14:35:22 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
> 08/07/18 14:35:22 DaemonCore: private command socket at <10.0.0.31:0?sock=2004_7849>
> 08/07/18 14:35:22 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
> 08/07/18 14:35:22 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1520893905)
> 08/07/18 14:35:22 Collector port not defined, will use default: 9618
> 08/07/18 14:35:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 2037
> 08/07/18 14:35:22 Waiting for /var/lock/condor/shared_port_ad to appear.
> 08/07/18 14:35:23 Found /var/lock/condor/shared_port_ad.
> 08/07/18 14:35:23 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 2038
> 08/07/18 14:35:33 Setting ready state 'Ready' for STARTD
> 
> Which looks OK to me. Does anyone have suggestions?
> 
> Thanks,
> Ben
> ----------------------------------------------------------------------------
>    Ben Pietras <ben.pietras@xxxxxxxxxxxxxxxx>
>    School of Physics and Astronomy,   Tel.  0161-275-4231
>    The University of Manchester,          Fax. 0161-275-5509
>    Manchester, M13 9PL.
> ----------------------------------------------------------------------------
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature