[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] GPU and condor?



Hello Xiang Ni,

Actually we did not do anything to make condor aware of the existence of GPUs.
What we have done is simple and somewhat stupid: That is, hard coded.

Let me post a condor setting in one of our computing node, so that you may be
clear about our implimentation:

===================================================================
DAEMON_LIST = MASTER, STARTD
NETWORK_INTERFACE = 192.168.2.123
NODE_ID  = 122
NUM_CPUS = 2
MTYPE    = "GPU_2G"
NETTYPE  = "GB"
N_HWCPUS = 2
STARTD_ATTRS = "$(COLLECTOR_HOST)", NODE_ID, MTYPE, NETTYPE, N_HWCPUS
===================================================================

This is the local condor config. file of 192.168.2.123 node. In this node we installed 2 GPUs
with model GTX-285, each has 2GB GPU memory. So we define a new attribute "MTYPE"
which has the value "GPU_2G", and force condor to believe that this node has 2 GPUs
(actually, condor thinks that it has 2 CPUs) by setting NUM_CPUS=2.

Therefore, if you want to mix machines, some has GPUs and some do not, then in our
simple implimentation we will just set different values of MTYPE in each machines, and
ask user to specify the "Requirement" in their condor command file, in order to submit
their jobs to the correct group of nodes.

Using this way, probably any standard condor distribution can be used in a GPU cluster.

Cheers,

T.H.Hsieh


2010/1/7 Xiang Ni <nixiang.nn@xxxxxxxxx>
Hi Tung-Han Hsieh,

Thanks and you sharing is very helpful!

I'm also interested in this topic and I have some confusions.

How do you make condor aware of the existence of GPUs? By modifying the Hawkeye?

Thanks!

Regards,

2010/1/7 Tung-Han Hsieh <tunghan.hsieh@xxxxxxxxx>:
> Hello,
>
> We have some experiences on building a GPU cluster using
> condor.
>
> Currently we have two GPU clusters, used for different
> research
> groups. Each cluster is composed by the following
> element:
>
> 1. Head node: Running condor server, for users login to
> build
>               their codes, submit jobs,
> etc.
>
> 2. File servers: The Lustre Cluster filesystems are
> deployeed.
>
> 3. Computing nodes: Each node has at least one, at most 4
> GPUs.
>                     Each cluster has more than 64 GPUs
> installed.
>
> 4. Communication: one has infiniband network, and the other
> use
>                   Gigabit
> network.
>
> The condor system can allocate multi-GPUs for users. In
> our
> implimentation the number of CPU cores in each computing
> node
> is not important. So in condor command file, users
> specify
> "machine_count" is actually specify the number of GPUs
> required.
> And the number of GPUs in each node is hard coded as
> the
> "NUM_CPUS" in the local condor config. file in each
> node.
>
> Honestly, we are not the condor experts. Hence we also
> developed
> some codes to help condor to do more complicated tasks, such
> as
> user quota for number of GPUs, GPU assignment, dead job cleaning,
> etc.
> But I guess all of these could be done by condor itself. We
> just
> don't know how to do, so try the somewhat stupid way to write
> codes
> to do
> those.
>
> Probably we can communicate the experience about this subject
> :)
>
>
> Cheers,
>
> T.H.Hsieh
>
>
> 2010/1/7 Marian Zvada <zvada@xxxxxxxx>
>>
>> Dear Condor Folks,
>>
>> is there someone in Condor user's community who has build GPU cluster
>> based on condor?
>> I mean someone, who has worker nodes hw with GPU graphical cards and job
>> management is done by condor on the top.
>>
>> We are very interested in this topic and would like to build such a
>> infrastructure (condor + gpu worker nodes) for research people in our
>> organization.
>> In first epoch of this project we'd like to develop standalone cluster:
>>
>> - master condor head node
>> - 5 gpu worker nodes (each worker node 2x nVIDIA GTX295)
>> - storage element for data
>>
>> I know, there is a lot to see on google about such a experiments, but I
>> wanted to ask directly from condor users about their
>> opinions/suggestions/recommendations since we are serious about to build
>> condor gpu cluster and use it in production for our research activities.
>>
>> If there is someone who has done similar setup and is willing share the
>> knowledge, I appreciate talk about it! Any url hints are welcome too...
>>
>> Thanks and regards,
>> Marian
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>



--
Xiang Ni
Sino-German Joint Software Institute
Computer Science&Engineer Deparment of Beihang University
100191
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/