[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Preserving Data Locality in DAGs?



On 8/18/06, Armen Babikyan <armenb@xxxxxxxxxx> wrote:
Hello,

Can we start some discussion on this issue?

I have several multiple-core machines that end up wasting a lot of
time/CPU utilization by ferrying large files between them, and it would
be incredibly advantageous to have some kind of feature that allows be
to group DAG nodes together so that they have a high preference to run
within the same pool of VM's on the same machine.

Would anyone else using Condor DAGs find this feature useful?  Is there
already a configuration parameter that spans the spectrum of data
locality, and I am overlooking it?  I'd be curious to see others' use
cases of Condor DAGs, too.

This is certainly a topic of interest on the list.

There are some serious constraints within the negotiation which make
it hard to do in the case where an entire machine is free and the
negotiation cycle does them all at once (since the startd add will not
have changed to indicate the presence of the friendly job until after
the cycle has finished)

Basically any attempt to use this in a serious throughput manner would
need internal handling of the 'like to be near' semantics.

If you could limit the negotiation speed to only take one job per
cycle (or one job per cluster per cycle) then you could start updating
class ads to indicate the presence of data and rank accordingly.

This then has issues with regard to removing the data and indicating
this has been done (if you have strong conventions on naming and
location of the data as I would assume then a hawkeye job may be able
to manage the update for you if not the removal.

You may be able to hack this in by only releasing one job at a time.
Obviously this seriously constrains your start latency and possibly
throughput if the jobs are not significant in terms of time till
completion.

This would not be perfect since too many jobs or another job taking a
friends slot would mean it would have no choice but to go elsewhere.

I suggest this merely as a start point in case you were thinking of
hacking it in yourself. I doubt this would perform well as suggested
without some serious tinkering and baby sitting but indicates how you
might proceed.

From the condor perspective there are several semi external tools
designed with data availability in mind like stork
(http://www.cs.wisc.edu/condor/stork/) which may be more likely to
benefit from the integration and thus more likely to provide
assistance.

Matt