[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Error submitting a large amount of jobs at the same time, multiple times.
- Date: Thu, 13 Dec 2012 16:43:44 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Error submitting a large amount of jobs at the same time, multiple times.
On 12/13/2012 4:03 PM, Michael Aschenbeck wrote:
I have a program that automatically makes a submit file, submits it, and
processes the results. Executing it once works fine and the output is
as expected. Recently, I'm trying to add support for running 3+
instances of this program at once, and running into errors when it
submits the submit files.
Details are as follows. There are 99 "queues" in each submit file.
Even when only submitting one of these files, it takes a very long
time (on the order of 5 minutes) to go from a "submitting jobs"
notification to the final output of "submitting jobs............ 99 jobs
submitted to cluster x".
Before even going into the 3+ instances at once, I'd suggest getting to
the bottom of why submitting 99 jobs from one instance of condor_submit
takes 5 minutes. On my Windows 7 laptop it takes a couple seconds.
When you say it takes 5+ minutes, does it take this long submitting into
an empty queue? If it is taking this long submitting to a schedd that
already has jobs queued, how many of those jobs are running, and for how
long do they typically run? Do your jobs run only for a couple seconds?
One guess for the slowness: if you have many jobs completing every
second, it could be either the condor_schedd process is swamped or the
disk file system holding the job queue is swamped. If this is the case,
the best thing to do would be to partition your work such that you have
slightly less jobs that run slightly longer, or you could try tweaks
like setting in your condor_config
CONDOR_FSYNC = False
For really busy production submit points (we are talking thousands of
simultaneously running jobs from one submit machine), some folks go with
an SSD drive to hold the SPOOL directory or at least the contents of the
job queue. See condor_config knob "JOB_QUEUE_LOG" at
Another guess: perhaps you telling HTCondor to spool your executable or
input files, perhaps via the "-remote" or "-spool" option to
condor_submit ? Spooling input files to a submit machine running on
Windows can currently be slow because input spooling on Windows blocks
the scheduler from doing other work (like accept new job submits) - note
on Linux things are much faster because the input spooling occurs in a
child process. Could you share your submit file, any command-line flags
being passed to condor_submit, and the output of condor_version?
Also take a peek at the following condor_submit settings to speed up job
skip_filechecks = True
copy_to_spool = False (this is the default on current versions of
HTCondor, but if you are running an older release it could be an issue)
hope the above helps,