[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Max Jobs



I'd like to add a few points to this discussion of how many jobs can be queued:

1. We have users who regularly have 150k+ jobs in the queue on one schedd.

2. Note that the number of jobs that can be queued can grow horizontally with HTCondor. Ie. you can always add more schedds into your pool to manage more queued jobs if you want to submit more jobs than either one schedd or one submit server can handle.

3. It is much faster to submit many jobs into one cluster of jobs. Not only is it faster, but it uses much less RAM. What I mean is you are much better off running condor_submit once with something like
   executable = foo
   output = output.$(Cluster).$(Process)
   queue 50000
in your submit file than running condor_submit 50,000 times with something like
   executable = foo
   output = output.$(Cluster).$(Process)
   queue

4. If you desire to submit more jobs than a single schedd (or server) can handle, you can today utilize DAGMan to describe a workflow with all your jobs (hundreds of thousands or whatever), and then tell DAGMan to limit the number of job clusters it submits into the schedd. E.g. if you want to submit a million jobs, make a submit file like the above that submits 5000 jobs at a time as a DAG node, then create a DAG that submits 200 instances of the node and have DAGMan limit the number of simultaneously submitted job clusters to just a handful (depending on the number of machines in your pool). See http://goo.gl/tvz5rn. We have users that regularly submit DAGs that consist of over 700k jobs.

5. We are exploring ways in the v8.3 development series to enable users to enqueue of millions of jobs without requiring the use of DAGMan.

Hope the above pointers help,
Todd


On 6/3/2014 8:54 AM, Ben Cotton wrote:
Oops, forgot to include the link to the wiki page mentioned:

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageLargeCondorPools

On Tue, Jun 3, 2014 at 9:40 AM, Ben Cotton
<ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:
Suchandra,

The default value of MAX_JOBS_SUBMITTED is the largest integer
supported on your platform. However, there are some constraints that
may prevent you from reaching that limit. I have seen ~50k jobs in a
queue before, but condor_q calls can get pretty sluggish at that
point.

The HTCondor wiki[1] says "Schedd requires a minimum of ~10k RAM per
job in the job queue. For jobs with huge environment values or other
big ClassAd attributes, the requirements are larger. " Technically,
you'll need more disk space with a larger job queue, but it's such a
small percentage of even the smallest disks these days that it's not
worth worrying about.

For our customers who use CycleServer to send jobs to schedulers, we
suggest setting the maximum queue size to be about 3 times the value
of MAX_JOBS_RUNNING. If you have something similar that buffers jobs,
then that seems reasonable. If you're only submitting directly to the
scheduler, then you will need to try different values to see what
works best for your use case.


Thanks,
BC

--
Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

http://www.cyclecomputing.com
twitter: @cyclecomputing





--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685