Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Max Jobs

Date: Tue, 03 Jun 2014 11:09:09 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Max Jobs

I'd like to add a few points to this discussion of how many jobs can bequeued:


1. We have users who regularly have 150k+ jobs in the queue on one schedd.

2. Note that the number of jobs that can be queued can grow horizontallywith HTCondor. Ie. you can always add more schedds into your pool tomanage more queued jobs if you want to submit more jobs than either oneschedd or one submit server can handle.

3. It is much faster to submit many jobs into one cluster of jobs. Notonly is it faster, but it uses much less RAM. What I mean is you aremuch better off running condor_submit once with something like

   executable = foo
   output = output.$(Cluster).$(Process)
   queue 50000

in your submit file than running condor_submit 50,000 times withsomething like

   executable = foo
   output = output.$(Cluster).$(Process)
   queue

4. If you desire to submit more jobs than a single schedd (or server)can handle, you can today utilize DAGMan to describe a workflow with allyour jobs (hundreds of thousands or whatever), and then tell DAGMan tolimit the number of job clusters it submits into the schedd. E.g. ifyou want to submit a million jobs, make a submit file like the abovethat submits 5000 jobs at a time as a DAG node, then create a DAG thatsubmits 200 instances of the node and have DAGMan limit the number ofsimultaneously submitted job clusters to just a handful (depending onthe number of machines in your pool). See http://goo.gl/tvz5rn. Wehave users that regularly submit DAGs that consist of over 700k jobs.

5. We are exploring ways in the v8.3 development series to enable usersto enqueue of millions of jobs without requiring the use of DAGMan.


Hope the above pointers help,
Todd


On 6/3/2014 8:54 AM, Ben Cotton wrote:

Oops, forgot to include the link to the wiki page mentioned:

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageLargeCondorPools

On Tue, Jun 3, 2014 at 9:40 AM, Ben Cotton
<ben.cotton@xxxxxxxxxxxxxxxxxx> wrote:

Suchandra,

The default value of MAX_JOBS_SUBMITTED is the largest integer
supported on your platform. However, there are some constraints that
may prevent you from reaching that limit. I have seen ~50k jobs in a
queue before, but condor_q calls can get pretty sluggish at that
point.

The HTCondor wiki[1] says "Schedd requires a minimum of ~10k RAM per
job in the job queue. For jobs with huge environment values or other
big ClassAd attributes, the requirements are larger. " Technically,
you'll need more disk space with a larger job queue, but it's such a
small percentage of even the smallest disks these days that it's not
worth worrying about.

For our customers who use CycleServer to send jobs to schedulers, we
suggest setting the maximum queue size to be about 3 times the value
of MAX_JOBS_RUNNING. If you have something similar that buffers jobs,
then that seems reasonable. If you're only submitting directly to the
scheduler, then you will need to try different values to see what
works best for your use case.


Thanks,
BC

--
Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

http://www.cyclecomputing.com
twitter: @cyclecomputing



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

References:
- [HTCondor-users] Max Jobs
  - From: Suchandra Thapa
- Re: [HTCondor-users] Max Jobs
  - From: Ben Cotton
- Re: [HTCondor-users] Max Jobs
  - From: Ben Cotton

Prev by Date: Re: [HTCondor-users] Max Jobs
Next by Date: Re: [HTCondor-users] ERROR "Ran out of system resources in auto allocation"; cannot use more than 16 cores on Windows?
Previous by thread: Re: [HTCondor-users] Max Jobs
Next by thread: [HTCondor-users] Trusting the CONDOR_HOST
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Max Jobs