[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Error submitting a large amount of jobs at the same time, multiple times.



On 12/13/2012 4:03 PM, Michael Aschenbeck wrote:
I have a program that automatically makes a submit file, submits it, and
processes the results.  Executing it once works fine and the output is
as expected.  Recently, I'm trying to add support for running 3+
instances of this program at once, and running into errors when it
submits the submit files.

Details are as follows.  There are 99 "queues" in each submit file.
  Even when only submitting one of these files, it takes a very long
time (on the order of 5 minutes) to go from a "submitting jobs"
notification to the final output of "submitting jobs............ 99 jobs
submitted to cluster x".

Before even going into the 3+ instances at once, I'd suggest getting to the bottom of why submitting 99 jobs from one instance of condor_submit takes 5 minutes. On my Windows 7 laptop it takes a couple seconds.

When you say it takes 5+ minutes, does it take this long submitting into an empty queue? If it is taking this long submitting to a schedd that already has jobs queued, how many of those jobs are running, and for how long do they typically run? Do your jobs run only for a couple seconds? One guess for the slowness: if you have many jobs completing every second, it could be either the condor_schedd process is swamped or the disk file system holding the job queue is swamped. If this is the case, the best thing to do would be to partition your work such that you have slightly less jobs that run slightly longer, or you could try tweaks like setting in your condor_config
  CONDOR_FSYNC = False
( see http://research.cs.wisc.edu/htcondor/manual/v7.9/3_3Configuration.html#16687 ) For really busy production submit points (we are talking thousands of simultaneously running jobs from one submit machine), some folks go with an SSD drive to hold the SPOOL directory or at least the contents of the job queue. See condor_config knob "JOB_QUEUE_LOG" at

http://research.cs.wisc.edu/htcondor/manual/v7.9/3_3Configuration.html#16742

Another guess: perhaps you telling HTCondor to spool your executable or input files, perhaps via the "-remote" or "-spool" option to condor_submit ? Spooling input files to a submit machine running on Windows can currently be slow because input spooling on Windows blocks the scheduler from doing other work (like accept new job submits) - note on Linux things are much faster because the input spooling occurs in a child process. Could you share your submit file, any command-line flags being passed to condor_submit, and the output of condor_version?

Also take a peek at the following condor_submit settings to speed up job submission (see http://research.cs.wisc.edu/htcondor/manual/v7.9/condor_submit.html) :

   skip_filechecks = True
copy_to_spool = False (this is the default on current versions of HTCondor, but if you are running an older release it could be an issue)

hope the above helps,
Todd