[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] best way to use cached data



On Monday, 10 December, 2012 at 12:59 PM, Dimitri Maziuk wrote:
On 12/09/2012 11:36 PM, John Wong wrote:

That is a different story I think. I'd love to see a node-level data
placement mechanism in condor, or at least the ability to evaluate ` [
-f /var/tmp/mydatabase ] ` at job submission time, but I don't believe
you can.
Perhaps I'm misunderstanding what you're after here but why don't you have this now?

Job A runs on Machine A and bring along a subset of your massive dataset in to some place like /tmp/cache. Before the job exits it leaves a small bit of Condor configuration in the ~condor/config directory, let's call it cache_contents.config, and the file simple says:

MyCacheContents = "subsetXYZ123"
STARTD_ATTRS = $(STARTD_ATTRS), MyCacheContents

And it advertises the cache contents to the world by running:

condor_reconfig -full

before it finally exits.

Now the ClassAd for the machine contains the attribute:

MyCacheContents = "subsetXYZ123"

And jobs can steer based on this string by putting:

rank = MyCacheContents =!= Undefined * MyCacheContents == "subsetXYZ123" * 1000

In their submission files. If the machine has the subset of the data cached already, the job will rank it higher than any other machine and prefer to run their first.

Adjust to suit your tastes for preemption and what not.

Simple but effective. You can make the identifying string for the cache contents encode some additional information if you want to do some sort of fuzzier logic around steering jobs to machines rather than just simple string matching.

Regards,
- Ian

-- 
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools
888.292.5320

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing