[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] tips on running a large queue



[This was a ticket to condor-admin, but I'm sending it to condor-users because
I thought it was useful to other people]

-=-=-=-=-=-=-=-=-=-=-=-=-=-

> 
> Greetings,
> 

<...>

> 
> I am running a very large set of auto-calibration models which
> will spit out 30,000 runs per set.  The number of sets will depend
> on how well the model is self-calibrating.  The joy of this is that
> each run  only takes approximately 10-seconds of execute time,
> a perfect app for high-throughput computing!
> 

I agree - it's perfect for high-throughput computing, but each run
may not be a perfect Condor job.

> I have a Perl script that runs a series of 'condor_run' commands
> that submits 20 runs at a time.  What I am finding is that if the
> queue gets backlogged to anywhere from 300-500 items, the submit
> machine's condor daemons choke and everything gets confused.
> Error logs indicate 'shadow exception' errors.  The submit machine
> is a PIII w/ 256Mb of RAM and 2Gb of swap space and 4Gb of
> free-disk space (only need <1Gb for all 30,000 runs).  I know this
> is not much of a system, but could this be causing my problems?

It's maybe a little light on RAM - with a lot of shadows running I could
see Condor wanting more, but it should be OK.

> 
> Do you have any suggestions as to what I can do to increase the
> queue limits?  I have played with some of the settings in the
> 'condor_config' file, but it hasn't had any apparent effect.
> The following is the only piece I see that relates to the number
> of jobs that can run.
> 
> ##--------------------------------------------------------------------
> ##  Miscellaneous:
> ##--------------------------------------------------------------------
> ##  Try to save this much swap space by not starting new shadows.
> ##  Specified in megabytes.
> RESERVED_SWAP           = 25
> 
> ##  What's the maximum number of jobs you want a single submit machine
> ##  to spawn shadows for?
> MAX_JOBS_RUNNING        = 1500
> 
> 
> Any help is appreciated,

Condor does not deal well with lots and lots of short running jobs - it's 
probably not the 30,000 jobs that is causing the problems, but the jobs
completeing every 10 seconds. Operations that need us to modify the job
queue file on disk are expensive (ie things like submits and job completions)
so we want to minimize that as much as possible, and when it does need to 
happen to batch it up as much as possible.

The first thing we'd suggest is to get rid of the 'condor_run' version, and
instead replace it with condor_submit, where you submit jobs 20 at a time,
using something like 'queue 20' at the end of submit file. condor_run runs
condor_submit internally, and each time you invoke 'condor_submit' you'll 
block the condor_schedd for a few seconds, so the fewer times we run
condor_submit the better.

The next thing we'd suggest is to try and eliminate as much of the overhead
as possible from condor_run - you'll have to edit the script to do this, but
there are two things you could turn down:

1. Where it calls 'condor_submit', change it to 'condor_submit -d', which will
disable the file permissions check that Condor does at submit time. You'll have
to make sure that get the permissions right yourself, but it will speed up the
submit process quite a bit.

2. Try and pre-stage as much of your job as possible. Pre-stage
your binary on your execute machine, and put 'transfer_executable=false' in
your submit file, and make sure that 'executable = /full/path/to/executable/'
is valid on the execute machine. Transfering executables is a major part of
the overhead of starting a job up, so if you have lots and lots of 
short-running jobs you wind up copying the same file over and over to execute
machines.

The next thing we'd suggest trying to do is actually individual runs up
as much as possible - instead of each machine doing 1 piece of work for 10 
seconds, can each machine get 60 pieces of work at the beginning and work
for 10 minutes before asking for more work? 

Try turning off the history file on your submit machine - there's an
entry the config file called "HISTORY", which is the file that Condor uses
to store information about jobs that have completed and left the queue.
Condor goes to great lengths to make sure this file is updated safely, and
does lots of file locking and transactional commands to it so it can recover
from a crash if need be.  With runs of 30,000, it seems likely that you don't 
care about all of the details about run #19,274. If the HISTORY file is not
defined, we skip this expensive step.

Our final suggestion would be to consider using a higher-level system such
as MW (http://www.cs.wisc.edu/condor/mw/) to farm out work - MW uses 
Condor to marshal resources, but is a much more specialized scheduler and
work distribution system that eliminates much of the overhead of using Condor.

Good luck!

-Erik

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>