[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Numerous, short jobs using HTCondor



Hello

I had quite a similar issue and it seems to get solved by adding the
following to my condor_config file :

JOB_RENICE_INCREMENT = 0
SYSAPI_GET_LOADAVG = False

WANT_VACATE_VANILLA = False
WANT_SUSPEND_VANILLA = False
START_VANILLA = True
SUSPEND_VANILLA = False
CONTINUE_VANILLA = True
PREEMPT_VANILLA = False
KILL_VANILLA = False

(see "New to ht condor and have basic questions" thrad on this mail-list)

Regards,

Mathieu


Le 13/01/2016 17:23, Matthew Hinton a écrit :
> Hi, 
> 
> We currently need to use HTCondor to run a large number (order 10k) of
> short jobs (taking approximately 10 seconds). I believe that HTCondor is
> not really designed for this, but these jobs are an adaptation of older
> jobs which take order minutes, against a new, more split, dataset, so we
> still need the resource management provided by HTCondor. 
> 
> I've had some fairly large issues getting tests of this to run with
> reasonable times, so was wondering if there are any
> settings/configuration which I should be looking at to improve this issue.
> 
> Current condor version: 8.5.1, all systems on Ubuntu 14.04. 
> All jobs using vanilla universe. We have a single manager, which is used
> as the SCHEDD, COLLECTOR, NEGOTIATOR and then 5 STARTD nodes. 
> 
> Steps to reproduce:
> 
> Set up a dag containing 10,000 jobs, labelled "JOB x test.sub"
> where test.sub: 
> /executable = /bin/sleep/
> /arguments = 1/
> /universe = vanilla/
> /transfer_executable = false/
> /requirements = TARGET.Machine == "<machine with 48 slots>"/
> /queue/
> /
> /
> Submit that dag. 
> 
> The real processing time of these jobs should be 10,000s / 48 slots
> which is under 3.5 minutes. 
> However, this dag takes approximately 30 mins to complete, meaning that
> the overhead for this (albeit extreme) example is around 900% of real
> processing time. 
> 
> We currently have DAGMAN_MAX_SUBMITS_PER_INTERVAL set at 200, but this
> doesn't seem to be the issue, since jobs are in the schedd queue, they
> are just not taking the expecting 1s to run. Instead we are seeing run
> times of up to 9 seconds. 
> 
> We see the same issue by changing the above submit file to /queue 10000
> /and submitting that. 
> 
> Please, could someone explain what is going on here which is taking so
> long? I would certainly expect some overhead, but this seems very high
> to me. If anyone has any suggestions on what to try to reduce this, then
> it would be greatly appreciated! 
> 
> Thanks,  
> 
> -- 
> *Matt Hinton*
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 


-- 
tel : +33 (0)6 87 30 83 59
######################################################################
##
##  condor_config
##
##  This is the global configuration file for condor. This is where
##  you define where the local config file is. Any settings
##  made here may potentially be overridden in the local configuration
##  file.  KEEP THAT IN MIND!  To double-check that a variable is
##  getting set from the configuration file that you expect, use
##  condor_config_val -v <variable name>
##
##  condor_config.annotated is a more detailed sample config file
##
##  Unless otherwise specified, settings that are commented out show
##  the defaults that are used if you don't define a value.  Settings
##  that are defined here MUST BE DEFINED since they have no default
##  value.
##
######################################################################

##  Where have you installed the bin, sbin and lib condor directories?   
RELEASE_DIR = C:\condor

##  Where is the local condor directory for each host?  This is where the local config file(s), logs and
##  spool/execute directories are located. this is the default for Linux and Unix systems.
#LOCAL_DIR = $(TILDE)
##  this is the default on Windows sytems
#LOCAL_DIR = $(RELEASE_DIR)

##  Where is the machine-specific local config file for each host?
LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local
##  If your configuration is on a shared file system, then this might be a better default
#LOCAL_CONFIG_FILE = $(RELEASE_DIR)\etc\$(HOSTNAME).local
##  If the local config file is not present, is it an error? (WARNING: This is a potential security issue.)
REQUIRE_LOCAL_CONFIG_FILE = FALSE

##  The normal way to do configuration with RPMs is to read all of the
##  files in a given directory that don't match a regex as configuration files.
##  Config files are read in lexicographic order.
LOCAL_CONFIG_DIR = $(LOCAL_DIR)\config
#LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$

##  Use a host-based security policy. By default CONDOR_HOST and the local machine will be allowed
use SECURITY : HOST_BASED
##  To expand your condor pool beyond a single host, set ALLOW_WRITE to match all of the hosts
#ALLOW_WRITE = *.cs.wisc.edu
##  FLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool).
#FLOCK_FROM =
##  FLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd).
#FLOCK_TO = condor.cs.wisc.edu, cm.example.edu

##--------------------------------------------------------------------
## Values set by the condor_configure script:
##--------------------------------------------------------------------

CONDOR_HOST = $(FULL_HOSTNAME)
NETWORK_INTERFACE = 192.168.1.181
COLLECTOR_NAME = ATLAS
UID_DOMAIN = 
CONDOR_ADMIN = 
SMTP_SERVER = 
ALLOW_READ = *
ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS)
ALLOW_ADMINISTRATOR = $(IP_ADDRESS)
JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe
use POLICY : ALWAYS_RUN_JOBS

WANT_VACATE = False
WANT_SUSPEND = False
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False

JOB_RENICE_INCREMENT = 0
SYSAPI_GET_LOADAVG = False

WANT_VACATE_VANILLA = False
WANT_SUSPEND_VANILLA = False
START_VANILLA = True
SUSPEND_VANILLA = False
CONTINUE_VANILLA = True
PREEMPT_VANILLA = False
KILL_VANILLA = False

NEGOTIATOR_CONSIDER_PREEMPTION = False

DAEMON_LIST = MASTER SCHEDD COLLECTOR NEGOTIATOR STARTD