[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Builing GPU Cluster Using Condor



Super short answer: Carsten Aulbert has lots of good information
in his post.  It's a great starting point: http://www.cs.wisc.edu/condor/manual/v7.6/4_4Hooks.html#sec:daemon-classad-hooks


At the moment, Condor can help by advertising information about
the GPU or GPUs on your nodes.  Jobs can then select slots based
on that information.

Condor does not yet have automatic support to advertise
information about your GPUs.  (We're working on it right now!)
So you'll have to set it up yourself.  You have a few options:


1. If you have a small number of nodes, or perhaps a large number
of identical nodes, you can add static attributes manually using
STARTD_ATTRS (http://www.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#16198 )
In the simplest case, it might just be:

HAS_GPU=TRUE
STARTD_ATTRS=HAS_GPU

Or you could give lots of information about the GPUs, and even
have slot-specific information.  For an extended example, see
Carsten's post.


2. You can write a program to automatically write your
configuration file.  This is still using STARTD_ATTRS, but scales
better.  This is actually how Carsten's configuration works; he
has some example code at the above link.


3. You can have Condor automatically run a program you provide to
learn about the GPUs.  This is the "Daemon ClassAd Hooks",
previous known as HawkEye and Condor Cron.
http://www.cs.wisc.edu/condor/manual/v7.6/4_4Hooks.html#sec:daemon-classad-hooks
This is the route taken by the condorgpu project you found.
Converting Carsten's scripts to work this way would be pretty
easy.


Once challenge is ensuring that you don't get conflicts are
multiple jobs try to use the same GPU.  If you have multiple
slots on a node (and you probably do), you'll need to ensure that
the various slot ads only advertise the GPU that they have.
Carsten's example above gives each slot 1 GPU until it runs out
of GPUs.  Note that Condor can't current enforce this.  So a job
will need to learn which GPU, if any, it is allow to have using
the environment or the command line.  Again, see Carsten's
configuration; his submit file passes the GPU's device ID in as a
command line argument.

-- 
Alan De Smet                              Condor Project Research
adesmet@xxxxxxxxxxx                http://www.cs.wisc.edu/condor/