[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.



I can run multi-processors jobs now but in order to submit a multi-cpu parallel job I have to submit it from a dedicated scheduler. In this case, the master. It means I have to login to the remote machine and submit from there.


Is this behavior expected? Can it be because I run it on Windows and it's it has some limitations? 


Thank you.


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Sofya Urbaniec <Sofya.Urbaniec@xxxxxxxxxx>
Sent: Wednesday, September 26, 2018 7:41:35 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.
 

Hello,

I'm trying to configure  to enable Parallel Jobs on HTCondor pool running on Windows.

I'm using Condor version 8.4.1

My condor_config on master:
 
 
######################################################################
##
##  condor_config
##
##  This is the global configuration file for condor. This is where
##  you define where the local config file is. Any settings
##  made here may potentially be overridden in the local configuration
##  file.  KEEP THAT IN MIND!  To double-check that a variable is
##  getting set from the coniguration file that you expect, use
##  condor_config_val -v <variable name>
##
##  condor_config.annotated is a more detailed sample config file
##
##  Unless otherwise specified, settings that are commented out show
##  the defaults that are used if you don't define a value.  Settings
##  that are defined here MUST BE DEFINED since they have no default
##  value.
##
######################################################################

##  Where have you installed the bin, sbin and lib condor directories? 

 
RELEASE_DIR = C:\condor

LOCAL_DIR = $(RELEASE_DIR)


LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local

REQUIRE_LOCAL_CONFIG_FILE = TRUE

LOCAL_CONFIG_DIR = $(LOCAL_DIR)

#
SETTABLE_ATTRS_CONFIG = *
SETTABLE_ATTRS_OWNER = TDVERS
STARTD_ATTRS = COLLECTOR_HOST_STRING, TDVERS

CONDOR_HOST = $(FULL_HOSTNAME)
COLLECTOR_NAME = thermal
UID_DOMAIN = domain.com
CONDOR_ADMIN = condor_admin_svc@xxxxxxxxxx
SMTP_SERVER = smtp.domain.com
ALLOW_READ = *
ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *.domain.com
ALLOW_ADMINISTRATOR = $(IP_ADDRESS), *.domain.com
JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe
START = FALSE
WANT_VACATE = FALSE
WANT_SUSPEND = TRUE

#  Dedicated Scheduler Config to enable Parallel Jobs.
DedicatedScheduler = "DedicatedScheduler@<FQDN of master>"
STARTD_ATTRS = $(STARTD_ATTRS),DedicatedScheduler

DAEMON_LIST = MASTER SCHEDD COLLECTOR NEGOTIATOR 

# Space X Additional Configuration
MAX_JOBS_RUNNING=225
START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 225
START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 225

CREDD_HOST = <FQDN of master>
CREDD_CACHE_LOCALLY = True

STARTER_ALLOW_RUNAS_OWNER = True
ALLOW_CONFIG = condor_admin_svc@*
HOSTALLOW_CONFIG = *.domain.com
SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD
SEC_CONFIG_NEGOTIATION = REQUIRED
SEC_CONFIG_AUTHENTICATION = REQUIRED
SEC_CONFIG_ENCRYPTION = REQUIRED
SEC_CONFIG_INTEGRITY = REQUIRED

I did condor_reconfig -all and condor_restart
I changed condor_config on two nodes out of 9 to see if it works. It's a condor config from one of the nodes:
###################################################################### ## ## condor_config ## ## This is the global configuration file for condor. This is where ## you define where the local config file is. Any settings ## made here may potentially be overridden in the local configuration ## file. KEEP THAT IN MIND! To double-check that a variable is ## getting set from the configuration file that you expect, use ## condor_config_val -v <variable name> ## ## condor_config.annotated is a more detailed sample config file ## ## Unless otherwise specified, settings that are commented out show ## the defaults that are used if you don't define a value. Settings ## that are defined here MUST BE DEFINED since they have no default ## value. ## ###################################################################### ## Where have you installed the bin, sbin and lib condor directories? RELEASE_DIR = E:\condor ## Where is the local condor directory for each host? This is where the local config file(s), logs and ## spool/execute directories are located. this is the default for Linux and Unix systems. #LOCAL_DIR = $(TILDE) ## this is the default on Windows sytems LOCAL_DIR = $(RELEASE_DIR) ## Where is the machine-specific local config file for each host? LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local ## If your configuration is on a shared file system, then this might be a better default #LOCAL_CONFIG_FILE = $(RELEASE_DIR)\etc\$(HOSTNAME).local ## If the local config file is not present, is it an error? (WARNING: This is a potential security issue.) REQUIRE_LOCAL_CONFIG_FILE = FALSE ## The normal way to do configuration with RPMs is to read all of the ## files in a given directory that don't match a regex as configuration files. ## Config files are read in lexicographic order. LOCAL_CONFIG_DIR = $(LOCAL_DIR)\config #LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$ ## Use a host-based security policy. By default CONDOR_HOST and the local machine will be allowed use SECURITY : HOST_BASED ## To expand your condor pool beyond a single host, set ALLOW_WRITE to match all of the hosts #ALLOW_WRITE = *.cs.wisc.edu ## FLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool). #FLOCK_FROM = ## FLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd). FLOCK_TO = <FQDN of Master> ##-------------------------------------------------------------------- ## Values set by the condor_configure script: ##-------------------------------------------------------------------- JAVA = C:\Program Files (x86)\Java\jre7\bin\java.exe CONDOR_HOST =
<FQDN of Master>
UID_DOMAIN = domain.com CONDOR_ADMIN = condor_admin_svc@xxxxxxxxxx SMTP_SERVER = smtp.domain.com ALLOW_READ = * ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *.doamin.com ALLOW_ADMINISTRATOR = $(IP_ADDRESS) JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe DAEMON_LIST = MASTER SCHEDD STARTD KBDD # Dedicated Scheduler DedicatedScheduler = "DedicatedScheduler@<FQDN of Master>" STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler RANK_FACTOR = 10000 RANK = (Scheduler =?= $(DedicatedScheduler) * $(RANK_FACTOR)) # Space X Additional Configuration CREDD_HOST = <FQDN of Master> CREDD_CACHE_LOCALLY = True STARTER_ALLOW_RUNAS_OWNER = True ALLOW_CONFIG = condor_admin_svc@* HOSTALLOW_CONFIG = $(IP_ADDRESS),*.domain.com SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD SEC_CONFIG_NEGOTIATION = REQUIRED SEC_CONFIG_AUTHENTICATION = REQUIRED SEC_CONFIG_ENCRYPTION = REQUIRED SEC_CONFIG_INTEGRITY = REQUIRED SLOTS_CONNECTED_TO_CONSOLE = 2 SLOTS_CONNECTED_TO_KEYBOARD = 2 NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) HighLoad = 1.0 BgndLoad = 0.3 CPU_Busy = ($(NonCondorLoadAvg) >= $(HighLoad)) CPU_Idle = ($(NonCondorLoadAvg) <= $(BgndLoad)) KeyboardBusy = (KeyboardIdle < 10) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer = (CurrentTime - EnteredCurrentActivity) START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND = $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT = (Activity == "Suspended") && $(ActivityTimer) > 300 SUSPEND = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND)) PREEMPT = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT)) START = (Scheduler =?= $(DedicatedScheduler)) || ($(START)) KILL = $(ActivityTimer) > 300 SETTABLE_ATTRS_CONFIG = * SETTABLE_ATTRS_OWNER = TDVERS STARTD_ATTRS = COLLECTOR_HOST_STRING, TDVERS TDVERS = "5.8"
I did condor_reconfig -all and condor_restart

But if I submit a parallel job it stack forever in idle mode. 

This is an example of the job:
universe = parallel should_transfer_files = Yes when_to_transfer_output = ON_EXIT notify_user = <email address> machine_count = 1 request_cpus = 2 notification = Always run_as_owner = true getenv = true log = sleep_log.txt output = sleep_stdout.txt error = sleep_stderr.txt executable = sleep.bat queue
Please advise.
Thank you.