[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Betr.: Running short-lived jobs on Condor



On 9/28/06, Zeeuw, L.V. de <L.V.de.Zeeuw@xxxxxx> wrote:
LS,

We are facing more or less the same challenge. We have a large pool (>1500 XP execution nodes) and one central machine from which we submit jobs. When we submit small jobs, which should run for about 30 seconds, then if we submit say 1000 of such jobs it would take 45 minutes for the results to come back to the submitting host from the hundreds of available execution hosts.

So, also for us, any pointers are appreciated to optimize for small jobs.

Forget the small jobs for now (though they won't help) - the setup you
have is flat out untenable for throughput with condor*

You have placed a vast farm with a bottleneck and central point of
failure. This is a bad idea.
If you wish to have a central submit point that's fine - just farm off
the actual submit to one of several schedds.

I appreciate this rather blase statement is harder to implement than
it sounds, but trust me on this - you will never get decent
performance out of a 1500 node farm with one schedd.

If you cannot make your jobs more 'chunky' then there is the alternate
possibility of using something like Technion's condor enhancements to
reduce the cost associated with the matchmaking and claim process.
This effectively does the chunking for you but will never be as good
at it as you can be(especially if you avoid re transfering input
data).

* If you meant you had a central submit machine with multiple schedd
daemons running on then less so but still bad

Matt