[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] faster condor_submits with dagman



Hi Steve,

I've thought about multiple schedds. I've been hesitant to go that route, but I like the idea of one for dagman and one for the jobs.

When you ask about spool files, do you mean the input and output files, or something else?
For input files, there are three, 4K, ~100K, ~400K
Output is one file, less than 5K

The variables you asked about are these:
JOB_START_COUNT = 50
JOB_START_DELAY = 2

But doesn't that control the rate of jobs after they are already in the queue? I feel like my problem is inability to get jobs in the queue quickly enough.

Peter


On Jun 30, 2010, at 13:47 , Steven Timm wrote:

Peter--have you considered using more than one schedd on your
submitter, that is what some of the big virtual organizations do.
For example, CDF has one schedd to manage the dagmen and another
one to manage the jobs that it spawns.  At one time in the past
they used to have as many as four schedd's for the jobs.  Basically
the dagman processing and the submission of the jobs that are
the dag stages are competing for the condor_schedd time.

Also, how many spool files do you have on each submitted job, and
how big, that could be an effect.

Also what's the value of JOB_START_COUNT, JOB_START_DELAY

Steve



On Wed, 30 Jun 2010, Peter Doherty wrote:

Hello,

I'm running a large amount of short running jobs (2 minutes, maybe?) on a large condor pool. I know, I know, this isn't ideal, not Condor's design, and I should figure out a way to make the jobs longer running. But I want to work on this a little more.
It's a large Condor DAG managing the jobs.

The jobs are able to finish as fast as dagman can submit new ones into the queue, so eventually I go from 1000 idle jobs, and 2000 running, to 10 idle jobs, and 2000 running, and i can't keep the queue full of pending jobs. I've moved the schedd's spool onto a RAMdisk to try and improve throughput, and this helped somewhat but not enough. Any other suggestions to tune to system for a higher rate of job throughput, before I give up and take a different approach?

Here's some of the variables I've been playing with, but with limited success. The machine (schedd and collector/negotiator on the same host) is a 2.4GHz 4-core AMD system with 8GB RAM.


SCHEDD_INTERVAL    = 30
DAGMAN_MAX_JOBS_IDLE = 1000
DAGMAN_SUBMIT_DELAY = 0
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 1000
DAGMAN_USER_LOG_SCAN_INTERVAL = 1
SCHEDD_INTERVAL_TIMESLICE = 0.10
SUBMIT_SKIP_FILECHECKS = True
HISTORY =
NEGOTIATOR_INTERVAL = 30
NEGOTIATOR_MAX_TIME_PER_SUBMITTER=20
NEGOTIATOR_MAX_TIME_PER_PIESPIN=20


Thanks,
Peter
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/