[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.



This is normal for parallel universe.  The reason is that the execute nodes must configured to respond to a single dedicated scheduler, so only jobs submitted to that scheduler will ever run.

 

You would split your execute nodes up by configuring ½ of them to use schedd A as the dedicated scheduler, and and 1/2 to use schedd B as the dedicated scheduler.  Then you could submit jobs to either schedd A and schedd B, but those jobs would never be able to use more than ½ of the execute nodes.

 

This is the same whether your schedd and/or execute nodes are Windows or Linux.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Sofya Urbaniec
Sent: Tuesday, October 2, 2018 1:13 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.

 

I can run multi-processors jobs now but in order to submit a multi-cpu parallel job I have to submit it from a dedicated scheduler. In this case, the master. It means I have to login to the remote machine and submit from there.

 

Is this behavior expected? Can it be because I run it on Windows and it's it has some limitations? 

 

Thank you.


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Sofya Urbaniec <Sofya.Urbaniec@xxxxxxxxxx>
Sent: Wednesday, September 26, 2018 7:41:35 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.

 

Hello,

 

I'm trying to configure  to enable Parallel Jobs on HTCondor pool running on Windows.

 

I'm using Condor version 8.4.1

 

My condor_config on master:

 

 

######################################################################

##

##  condor_config

##

##  This is the global configuration file for condor. This is where

##  you define where the local config file is. Any settings

##  made here may potentially be overridden in the local configuration

##  file.  KEEP THAT IN MIND!  To double-check that a variable is

##  getting set from the coniguration file that you expect, use

##  condor_config_val -v <variable name>

##

##  condor_config.annotated is a more detailed sample config file

##

##  Unless otherwise specified, settings that are commented out show

##  the defaults that are used if you don't define a value.  Settings

##  that are defined here MUST BE DEFINED since they have no default

##  value.

##

######################################################################

 

##  Where have you installed the bin, sbin and lib condor directories? 

 

 

RELEASE_DIR = C:\condor
 
LOCAL_DIR = $(RELEASE_DIR)
 
 
LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local
 
REQUIRE_LOCAL_CONFIG_FILE = TRUE
 
LOCAL_CONFIG_DIR = $(LOCAL_DIR)
 
#
SETTABLE_ATTRS_CONFIG = *
SETTABLE_ATTRS_OWNER = TDVERS
STARTD_ATTRS = COLLECTOR_HOST_STRING, TDVERS
 
CONDOR_HOST = $(FULL_HOSTNAME)
COLLECTOR_NAME = thermal
UID_DOMAIN = domain.com
CONDOR_ADMIN = condor_admin_svc@xxxxxxxxxx
SMTP_SERVER = smtp.domain.com
ALLOW_READ = *
ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *.domain.com
ALLOW_ADMINISTRATOR = $(IP_ADDRESS), *.domain.com
JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe
START = FALSE
WANT_VACATE = FALSE
WANT_SUSPEND = TRUE
 
#  Dedicated Scheduler Config to enable Parallel Jobs.
DedicatedScheduler = "DedicatedScheduler@<FQDN of master>"
STARTD_ATTRS = $(STARTD_ATTRS),DedicatedScheduler
 
DAEMON_LIST = MASTER SCHEDD COLLECTOR NEGOTIATOR 
 
# Space X Additional Configuration
MAX_JOBS_RUNNING=225
START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 225
START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 225
 
CREDD_HOST = <FQDN of master>
CREDD_CACHE_LOCALLY = True
 
STARTER_ALLOW_RUNAS_OWNER = True
ALLOW_CONFIG = condor_admin_svc@*
HOSTALLOW_CONFIG = *.domain.com
SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD
SEC_CONFIG_NEGOTIATION = REQUIRED
SEC_CONFIG_AUTHENTICATION = REQUIRED
SEC_CONFIG_ENCRYPTION = REQUIRED
SEC_CONFIG_INTEGRITY = REQUIRED
 
I did condor_reconfig -all and condor_restart
I changed condor_config on two nodes out of 9 to see if it works. It's a condor config from one of the nodes:
######################################################################
##
##  condor_config
##
##  This is the global configuration file for condor. This is where
##  you define where the local config file is. Any settings
##  made here may potentially be overridden in the local configuration
##  file.  KEEP THAT IN MIND!  To double-check that a variable is
##  getting set from the configuration file that you expect, use
##  condor_config_val -v <variable name>
##
##  condor_config.annotated is a more detailed sample config file
##
##  Unless otherwise specified, settings that are commented out show
##  the defaults that are used if you don't define a value.  Settings
##  that are defined here MUST BE DEFINED since they have no default
##  value.
##
######################################################################
 
##  Where have you installed the bin, sbin and lib condor directories?   
RELEASE_DIR = E:\condor
 
##  Where is the local condor directory for each host?  This is where the local config file(s), logs and
##  spool/execute directories are located. this is the default for Linux and Unix systems.
#LOCAL_DIR = $(TILDE)
##  this is the default on Windows sytems
LOCAL_DIR = $(RELEASE_DIR)
 
##  Where is the machine-specific local config file for each host?
LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local
##  If your configuration is on a shared file system, then this might be a better default
#LOCAL_CONFIG_FILE = $(RELEASE_DIR)\etc\$(HOSTNAME).local
##  If the local config file is not present, is it an error? (WARNING: This is a potential security issue.)
REQUIRE_LOCAL_CONFIG_FILE = FALSE
 
##  The normal way to do configuration with RPMs is to read all of the
##  files in a given directory that don't match a regex as configuration files.
##  Config files are read in lexicographic order.
LOCAL_CONFIG_DIR = $(LOCAL_DIR)\config
#LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$
 
##  Use a host-based security policy. By default CONDOR_HOST and the local machine will be allowed
use SECURITY : HOST_BASED
##  To expand your condor pool beyond a single host, set ALLOW_WRITE to match all of the hosts
#ALLOW_WRITE = *.cs.wisc.edu
##  FLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool).
#FLOCK_FROM =
##  FLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd).
FLOCK_TO = <FQDN of Master>
 
##--------------------------------------------------------------------
## Values set by the condor_configure script:
##--------------------------------------------------------------------
JAVA = C:\Program Files (x86)\Java\jre7\bin\java.exe
CONDOR_HOST = 
<FQDN of Master>
 
UID_DOMAIN = domain.com CONDOR_ADMIN = condor_admin_svc@xxxxxxxxxx SMTP_SERVER = smtp.domain.com ALLOW_READ = * ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *.doamin.com ALLOW_ADMINISTRATOR = $(IP_ADDRESS) JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe DAEMON_LIST = MASTER SCHEDD STARTD KBDD  # Dedicated Scheduler DedicatedScheduler = "DedicatedScheduler@<FQDN of Master>" STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler RANK_FACTOR = 10000 RANK = (Scheduler =?= $(DedicatedScheduler) * $(RANK_FACTOR))  # Space X Additional Configuration CREDD_HOST = <FQDN of Master> CREDD_CACHE_LOCALLY = True STARTER_ALLOW_RUNAS_OWNER = True ALLOW_CONFIG = condor_admin_svc@* HOSTALLOW_CONFIG = $(IP_ADDRESS),*.domain.com SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD SEC_CONFIG_NEGOTIATION = REQUIRED SEC_CONFIG_AUTHENTICATION = REQUIRED SEC_CONFIG_ENCRYPTION = REQUIRED SEC_CONFIG_INTEGRITY = REQUIRED   SLOTS_CONNECTED_TO_CONSOLE = 2 SLOTS_CONNECTED_TO_KEYBOARD = 2 NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) HighLoad = 1.0 BgndLoad = 0.3 CPU_Busy = ($(NonCondorLoadAvg) >= $(HighLoad)) CPU_Idle = ($(NonCondorLoadAvg) <= $(BgndLoad)) KeyboardBusy = (KeyboardIdle < 10) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer = (CurrentTime - EnteredCurrentActivity) START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND = $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT = (Activity == "Suspended") && $(ActivityTimer) > 300 SUSPEND = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND)) PREEMPT = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT)) START = (Scheduler =?= $(DedicatedScheduler)) || ($(START)) KILL = $(ActivityTimer) > 300   SETTABLE_ATTRS_CONFIG = * SETTABLE_ATTRS_OWNER = TDVERS STARTD_ATTRS = COLLECTOR_HOST_STRING, TDVERS TDVERS = "5.8"
I did condor_reconfig -all and condor_restart
 
But if I submit a parallel job it stack forever in idle mode. 
 
This is an example of the job:
universe = parallel
should_transfer_files = Yes
when_to_transfer_output = ON_EXIT
notify_user = <email address>
machine_count = 1
request_cpus = 2
notification = Always
run_as_owner = true
getenv = true
log = sleep_log.txt
output = sleep_stdout.txt
error = sleep_stderr.txt
 
executable = sleep.bat
queue
Please advise.
Thank you.