[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.



Hi John,


Thank you for your reply. 


Even if I have 2 dedicated schedulers and split my nodes between them I will have to login to them and submit from there, correct? I'm trying to avod this and be able to submit multi processes runs from the local machine the same way how I submit not parallel universe jobs.


 Thank you,

Sofya


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Wednesday, October 3, 2018 12:17:02 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.
 

This is normal for parallel universe.  The reason is that the execute nodes must configured to respond to a single dedicated scheduler, so only jobs submitted to that scheduler will ever run.

 

You would split your execute nodes up by configuring ½ of them to use schedd A as the dedicated scheduler, and and 1/2 to use schedd B as the dedicated scheduler.  Then you could submit jobs to either schedd A and schedd B, but those jobs would never be able to use more than ½ of the execute nodes.

 

This is the same whether your schedd and/or execute nodes are Windows or Linux.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Sofya Urbaniec
Sent: Tuesday, October 2, 2018 1:13 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.

 

I can run multi-processors jobs now but in order to submit a multi-cpu parallel job I have to submit it from a dedicated scheduler. In this case, the master. It means I have to login to the remote machine and submit from there.

 

Is this behavior expected? Can it be because I run it on Windows and it's it has some limitations? 

 

Thank you.


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Sofya Urbaniec <Sofya.Urbaniec@xxxxxxxxxx>
Sent: Wednesday, September 26, 2018 7:41:35 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.

 

Hello,

 

I'm trying to configure  to enable Parallel Jobs on HTCondor pool running on Windows.

 

I'm using Condor version 8.4.1

 

My condor_config on master:

 

 

######################################################################

##

##  condor_config

##

##  This is the global configuration file for condor. This is where

##  you define where the local config file is. Any settings

##  made here may potentially be overridden in the local configuration

##  file.  KEEP THAT IN MIND!  To double-check that a variable is

##  getting set from the coniguration file that you expect, use

##  condor_config_val -v <variable name>

##

##  condor_config.annotated is a more detailed sample config file

##

##  Unless otherwise specified, settings that are commented out show

##  the defaults that are used if you don't define a value.  Settings

##  that are defined here MUST BE DEFINED since they have no default

##  value.

##

######################################################################

 

##  Where have you installed the bin, sbin and lib condor directories? 

 

 

RELEASE_DIR = C:\condor
 
LOCAL_DIR = $(RELEASE_DIR)
 
 
LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local
 
REQUIRE_LOCAL_CONFIG_FILE = TRUE
 
LOCAL_CONFIG_DIR = $(LOCAL_DIR)
 
#
SETTABLE_ATTRS_CONFIG = *
SETTABLE_ATTRS_OWNER = TDVERS
STARTD_ATTRS = COLLECTOR_HOST_STRING, TDVERS
 
CONDOR_HOST = $(FULL_HOSTNAME)
COLLECTOR_NAME = thermal
UID_DOMAIN = domain.com
CONDOR_ADMIN = condor_admin_svc@xxxxxxxxxx
SMTP_SERVER = smtp.domain.com
ALLOW_READ = *
ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *.domain.com
ALLOW_ADMINISTRATOR = $(IP_ADDRESS), *.domain.com
JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe
START = FALSE
WANT_VACATE = FALSE
WANT_SUSPEND = TRUE
 
#  Dedicated Scheduler Config to enable Parallel Jobs.
DedicatedScheduler = "DedicatedScheduler@<FQDN of master>"
STARTD_ATTRS = $(STARTD_ATTRS),DedicatedScheduler
 
DAEMON_LIST = MASTER SCHEDD COLLECTOR NEGOTIATOR 
 
# Space X Additional Configuration
MAX_JOBS_RUNNING=225
START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 225
START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 225
 
CREDD_HOST = <FQDN of master>
CREDD_CACHE_LOCALLY = True
 
STARTER_ALLOW_RUNAS_OWNER = True
ALLOW_CONFIG = condor_admin_svc@*
HOSTALLOW_CONFIG = *.domain.com
SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD
SEC_CONFIG_NEGOTIATION = REQUIRED
SEC_CONFIG_AUTHENTICATION = REQUIRED
SEC_CONFIG_ENCRYPTION = REQUIRED
SEC_CONFIG_INTEGRITY = REQUIRED
 
I did condor_reconfig -all and condor_restart
I changed condor_config on two nodes out of 9 to see if it works. It's a condor config from one of the nodes:
######################################################################
##
##  condor_config
##
##  This is the global configuration file for condor. This is where
##  you define where the local config file is. Any settings
##  made here may potentially be overridden in the local configuration
##  file.  KEEP THAT IN MIND!  To double-check that a variable is
##  getting set from the configuration file that you expect, use
##  condor_config_val -v <variable name>
##
##  condor_config.annotated is a more detailed sample config file
##
##  Unless otherwise specified, settings that are commented out show
##  the defaults that are used if you don't define a value.  Settings
##  that are defined here MUST BE DEFINED since they have no default
##  value.
##
######################################################################
 
##  Where have you installed the bin, sbin and lib condor directories?   
RELEASE_DIR = E:\condor
 
##  Where is the local condor directory for each host?  This is where the local config file(s), logs and
##  spool/execute directories are located. this is the default for Linux and Unix systems.
#LOCAL_DIR = $(TILDE)
##  this is the default on Windows sytems
LOCAL_DIR = $(RELEASE_DIR)
 
##  Where is the machine-specific local config file for each host?
LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local
##  If your configuration is on a shared file system, then this might be a better default
#LOCAL_CONFIG_FILE = $(RELEASE_DIR)\etc\$(HOSTNAME).local
##  If the local config file is not present, is it an error? (WARNING: This is a potential security issue.)
REQUIRE_LOCAL_CONFIG_FILE = FALSE
 
##  The normal way to do configuration with RPMs is to read all of the
##  files in a given directory that don't match a regex as configuration files.
##  Config files are read in lexicographic order.
LOCAL_CONFIG_DIR = $(LOCAL_DIR)\config
#LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$
 
##  Use a host-based security policy. By default CONDOR_HOST and the local machine will be allowed
use SECURITY : HOST_BASED
##  To expand your condor pool beyond a single host, set ALLOW_WRITE to match all of the hosts
#ALLOW_WRITE = *.cs.wisc.edu
##  FLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool).
#FLOCK_FROM =
##  FLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd).
FLOCK_TO = <FQDN of Master>
 
##--------------------------------------------------------------------
## Values set by the condor_configure script:
##--------------------------------------------------------------------
JAVA = C:\Program Files (x86)\Java\jre7\bin\java.exe
CONDOR_HOST = 
<FQDN of Master>
 
UID_DOMAIN = domain.com CONDOR_ADMIN = condor_admin_svc@xxxxxxxxxx SMTP_SERVER = smtp.domain.com ALLOW_READ = * ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *.doamin.com ALLOW_ADMINISTRATOR = $(IP_ADDRESS) JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe DAEMON_LIST = MASTER SCHEDD STARTD KBDD  # Dedicated Scheduler DedicatedScheduler = "DedicatedScheduler@<FQDN of Master>" STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler RANK_FACTOR = 10000 RANK = (Scheduler =?= $(DedicatedScheduler) * $(RANK_FACTOR))  # Space X Additional Configuration CREDD_HOST = <FQDN of Master> CREDD_CACHE_LOCALLY = True STARTER_ALLOW_RUNAS_OWNER = True ALLOW_CONFIG = condor_admin_svc@* HOSTALLOW_CONFIG = $(IP_ADDRESS),*.domain.com SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD SEC_CONFIG_NEGOTIATION = REQUIRED SEC_CONFIG_AUTHENTICATION = REQUIRED SEC_CONFIG_ENCRYPTION = REQUIRED SEC_CONFIG_INTEGRITY = REQUIRED   SLOTS_CONNECTED_TO_CONSOLE = 2 SLOTS_CONNECTED_TO_KEYBOARD = 2 NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) HighLoad = 1.0 BgndLoad = 0.3 CPU_Busy = ($(NonCondorLoadAvg) >= $(HighLoad)) CPU_Idle = ($(NonCondorLoadAvg) <= $(BgndLoad)) KeyboardBusy = (KeyboardIdle < 10) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer = (CurrentTime - EnteredCurrentActivity) START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND = $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT = (Activity == "Suspended") && $(ActivityTimer) > 300 SUSPEND = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND)) PREEMPT = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT)) START = (Scheduler =?= $(DedicatedScheduler)) || ($(START)) KILL = $(ActivityTimer) > 300   SETTABLE_ATTRS_CONFIG = * SETTABLE_ATTRS_OWNER = TDVERS STARTD_ATTRS = COLLECTOR_HOST_STRING, TDVERS TDVERS = "5.8"
I did condor_reconfig -all and condor_restart
 
But if I submit a parallel job it stack forever in idle mode. 
 
This is an example of the job:
universe = parallel
should_transfer_files = Yes
when_to_transfer_output = ON_EXIT
notify_user = <email address>
machine_count = 1
request_cpus = 2
notification = Always
run_as_owner = true
getenv = true
log = sleep_log.txt
output = sleep_stdout.txt
error = sleep_stderr.txt
 
executable = sleep.bat
queue
Please advise.
Thank you.