[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.



Hi Sofya,

The line returns in the config file you pasted from the execute machine got erased a some point in the email chain, so it's a little hard to parse. It looks like you have figured out that you have to set DedicatedScheduler on your execute machines, which is correct.  (Just to note, on your master machine running the collector, negotiator, and schedd, you do not need to set DedicatedScheduler.) I assume that "<FQDN of Master>" in your config is actually replaced in your config by the correct hostname?

When you say that you can multiprocessor jobs now, does that mean you can run vanilla universe jobs with request_cpus = 2?

After you submit a parallel universe job, while it's idle, if you look at the SchedLog in your condor log directory (`condor_config_val log`), are there any lines mentioning something like "10/02/18 13:23:20 (pid:39306) Found 1 potential dedicated resources in 0 seconds"

Jason Patton

On Tue, Oct 2, 2018 at 1:13 PM Sofya Urbaniec <Sofya.Urbaniec@xxxxxxxxxx> wrote:

I can run multi-processors jobs now but in order to submit a multi-cpu parallel job I have to submit it from a dedicated scheduler. In this case, the master. It means I have to login to the remote machine and submit from there.


Is this behavior expected? Can it be because I run it on Windows and it's it has some limitations? 


Thank you.


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Sofya Urbaniec <Sofya.Urbaniec@xxxxxxxxxx>
Sent: Wednesday, September 26, 2018 7:41:35 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.
 

Hello,

I'm trying to configure  to enable Parallel Jobs on HTCondor pool running on Windows.

I'm using Condor version 8.4.1

My condor_config on master:
 
 
######################################################################
##
##  condor_config
##
##  This is the global configuration file for condor. This is where
##  you define where the local config file is. Any settings
##  made here may potentially be overridden in the local configuration
##  file.  KEEP THAT IN MIND!  To double-check that a variable is
##  getting set from the coniguration file that you expect, use
##  condor_config_val -v <variable name>
##
##  condor_config.annotated is a more detailed sample config file
##
##  Unless otherwise specified, settings that are commented out show
##  the defaults that are used if you don't define a value.  Settings
##  that are defined here MUST BE DEFINED since they have no default
##  value.
##
######################################################################

##  Where have you installed the bin, sbin and lib condor directories? 

 
RELEASE_DIR = C:\condor

LOCAL_DIR = $(RELEASE_DIR)


LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local

REQUIRE_LOCAL_CONFIG_FILE = TRUE

LOCAL_CONFIG_DIR = $(LOCAL_DIR)

#
SETTABLE_ATTRS_CONFIG = *
SETTABLE_ATTRS_OWNER = TDVERS
STARTD_ATTRS = COLLECTOR_HOST_STRING, TDVERS

CONDOR_HOST = $(FULL_HOSTNAME)
COLLECTOR_NAME = thermal
UID_DOMAIN = domain.com
CONDOR_ADMIN = condor_admin_svc@xxxxxxxxxx
SMTP_SERVER = smtp.domain.com
ALLOW_READ = *
ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *.domain.com
ALLOW_ADMINISTRATOR = $(IP_ADDRESS), *.domain.com
JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe
START = FALSE
WANT_VACATE = FALSE
WANT_SUSPEND = TRUE

#  Dedicated Scheduler Config to enable Parallel Jobs.
DedicatedScheduler = "DedicatedScheduler@<FQDN of master>"
STARTD_ATTRS = $(STARTD_ATTRS),DedicatedScheduler

DAEMON_LIST = MASTER SCHEDD COLLECTOR NEGOTIATOR 

# Space X Additional Configuration
MAX_JOBS_RUNNING=225
START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 225
START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 225

CREDD_HOST = <FQDN of master>
CREDD_CACHE_LOCALLY = True

STARTER_ALLOW_RUNAS_OWNER = True
ALLOW_CONFIG = condor_admin_svc@*
HOSTALLOW_CONFIG = *.domain.com
SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD
SEC_CONFIG_NEGOTIATION = REQUIRED
SEC_CONFIG_AUTHENTICATION = REQUIRED
SEC_CONFIG_ENCRYPTION = REQUIRED
SEC_CONFIG_INTEGRITY = REQUIRED

I did condor_reconfig -all and condor_restart
I changed condor_config on two nodes out of 9 to see if it works. It's a condor config from one of the nodes:
###################################################################### ## ## condor_config ## ## This is the global configuration file for condor. This is where ## you define where the local config file is. Any settings ## made here may potentially be overridden in the local configuration ## file. KEEP THAT IN MIND! To double-check that a variable is ## getting set from the configuration file that you expect, use ## condor_config_val -v <variable name> ## ## condor_config.annotated is a more detailed sample config file ## ## Unless otherwise specified, settings that are commented out show ## the defaults that are used if you don't define a value. Settings ## that are defined here MUST BE DEFINED since they have no default ## value. ## ###################################################################### ## Where have you installed the bin, sbin and lib condor directories? RELEASE_DIR = E:\condor ## Where is the local condor directory for each host? This is where the local config file(s), logs and ## spool/execute directories are located. this is the default for Linux and Unix systems. #LOCAL_DIR = $(TILDE) ## this is the default on Windows sytems LOCAL_DIR = $(RELEASE_DIR) ## Where is the machine-specific local config file for each host? LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local ## If your configuration is on a shared file system, then this might be a better default #LOCAL_CONFIG_FILE = $(RELEASE_DIR)\etc\$(HOSTNAME).local ## If the local config file is not present, is it an error? (WARNING: This is a potential security issue.) REQUIRE_LOCAL_CONFIG_FILE = FALSE ## The normal way to do configuration with RPMs is to read all of the ## files in a given directory that don't match a regex as configuration files. ## Config files are read in lexicographic order. LOCAL_CONFIG_DIR = $(LOCAL_DIR)\config #LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$ ## Use a host-based security policy. By default CONDOR_HOST and the local machine will be allowed use SECURITY : HOST_BASED ## To expand your condor pool beyond a single host, set ALLOW_WRITE to match all of the hosts #ALLOW_WRITE = *.cs.wisc.edu ## FLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool). #FLOCK_FROM = ## FLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd). FLOCK_TO = <FQDN of Master> ##-------------------------------------------------------------------- ## Values set by the condor_configure script: ##-------------------------------------------------------------------- JAVA = C:\Program Files (x86)\Java\jre7\bin\java.exe CONDOR_HOST =
<FQDN of Master>
UID_DOMAIN = domain.com CONDOR_ADMIN = condor_admin_svc@xxxxxxxxxx SMTP_SERVER = smtp.domain.com ALLOW_READ = * ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *.doamin.com ALLOW_ADMINISTRATOR = $(IP_ADDRESS) JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe DAEMON_LIST = MASTER SCHEDD STARTD KBDD # Dedicated Scheduler DedicatedScheduler = "DedicatedScheduler@<FQDN of Master>" STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler RANK_FACTOR = 10000 RANK = (Scheduler =?= $(DedicatedScheduler) * $(RANK_FACTOR)) # Space X Additional Configuration CREDD_HOST = <FQDN of Master> CREDD_CACHE_LOCALLY = True STARTER_ALLOW_RUNAS_OWNER = True ALLOW_CONFIG = condor_admin_svc@* HOSTALLOW_CONFIG = $(IP_ADDRESS),*.domain.com SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD SEC_CONFIG_NEGOTIATION = REQUIRED SEC_CONFIG_AUTHENTICATION = REQUIRED SEC_CONFIG_ENCRYPTION = REQUIRED SEC_CONFIG_INTEGRITY = REQUIRED SLOTS_CONNECTED_TO_CONSOLE = 2 SLOTS_CONNECTED_TO_KEYBOARD = 2 NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) HighLoad = 1.0 BgndLoad = 0.3 CPU_Busy = ($(NonCondorLoadAvg) >= $(HighLoad)) CPU_Idle = ($(NonCondorLoadAvg) <= $(BgndLoad)) KeyboardBusy = (KeyboardIdle < 10) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer = (CurrentTime - EnteredCurrentActivity) START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND = $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT = (Activity == "Suspended") && $(ActivityTimer) > 300 SUSPEND = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND)) PREEMPT = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT)) START = (Scheduler =?= $(DedicatedScheduler)) || ($(START)) KILL = $(ActivityTimer) > 300 SETTABLE_ATTRS_CONFIG = * SETTABLE_ATTRS_OWNER = TDVERS STARTD_ATTRS = COLLECTOR_HOST_STRING, TDVERS TDVERS = "5.8"
I did condor_reconfig -all and condor_restart

But if I submit a parallel job it stack forever in idle mode. 

This is an example of the job:
universe = parallel should_transfer_files = Yes when_to_transfer_output = ON_EXIT notify_user = <email address> machine_count = 1 request_cpus = 2 notification = Always run_as_owner = true getenv = true log = sleep_log.txt output = sleep_stdout.txt error = sleep_stderr.txt executable = sleep.bat queue
Please advise.
Thank you. 


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/