[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with jobs



On 12/8/05, Chris Miles <chrismiles@xxxxxxxxxxxxxxxx> wrote:
> While I apreciate running one job on one node may take longer because of
> overheads, I would think that when submitting 50 jobs simultaneously,
> the overhead time will be at the same time also therefor giving me an over
> all speed increase still because I would still not have had time to run
> those
> 50 jobs individually.

Not if the overhead of negotiation / data transfer (both of which
would not occur if you ran locally) is sizable compared to the length
of the job

> Might it be possible to have my condor job do more than 1 of my jobs?

absolutely - this is a much better idea

> At the moment I use the same executable and the same dataset on every job
> that runs but the application does slightly different work with slightly
> different
> results, and then prints the output to the console.
>
> Might it make a difference for example..

a massive difference, this is more efficient on just about every front.

> I dont want to change my main console application code but I could write
> another
> small front end program as the main executable which will then fire my job.
>
> for example
>
> 1. submission file specifies executable = frontend, and transfers
> console_som and
> dataset as required files.
>
> 2. submission file specifies the process id as an argument to the front end
>
> 3. on the condor node frontend runs 100 times piping the results from
> console_som into an output file
>
> 4. returns 100 output files back to me (from each machine)
>
> better idea?

Much much better, you can then tune how long a job (in the condor
sense) will take.

Note for windows users process creation costs on windows can be quite
sizable so if the job is going to be in the order of seconds you may
want to consider keeping the iterative loop *in process*. his probably
gets better cache related behaviour too.

It is important when designing tasks for a distributed system to
factor in data transfer costs and framework overheads.

For condor in particular this means

1) If your main output of useful data is to system out or system err
then you have no control over how this data is handled by condor. For
instance you may wish to bzip2 / z7 / zip the whole thing at the end
if you know the cost of compression is offset by the reduction in data
transfer costs (or indeed you wish the data to be stored in a
compressed form anyway so why not include doing this as part of the
job)
On a related note if the output is only needed for error
handling/debugging then keep it terse in the common case

2) If your jobs run in < 10 mins getting a *claim* on a machine is
relatively expensive (in terms of costs not linked in anyway to what
your job must do to actually run). Sending another job to an existing
claim is very fast in comparison.
Thus use of tools such as dagman without significant amounts of
parallel tasks in it can hurt throughput (compared to just running the
same linear set of jobs by a plain old batch script)

3) The schedd can only do one thing at once* and it has to do a lot.
Since it must negotiate every so often and cannot activate claims
while doing this then spending 1 minute every 5 doing negotiation
could have a serious impact in your throughput.
I have a script which shows my farm's state and colours in the
Claimed/Busy vs Claimed/Idle machines differently, it is a very good
and quick way of spotting inefficiencies creeping in.

* I really hope this changes in future there seems to be ongoning work
on this at least for the low hanging fruit

There are prob some more 'best practices' for using condor with
smaller sized jobs, but these are prob the most important off the top
of my head.

Matt