Re: [Condor-users] GPU and condor?

We have some experiences on building a GPU cluster using condor.               
Currently we have two GPU clusters, used for different research                
groups. Each cluster is composed by the following element:                     
1. Head node: Running condor server, for users login to build                  
              their codes, submit jobs, etc.                                   
2. File servers: The Lustre Cluster filesystems are deployeed.                 
3. Computing nodes: Each node has at least one, at most 4 GPUs.                
                    Each cluster has more than 64 GPUs installed.              
4. Communication: one has infiniband network, and the other use                
                  Gigabit network.                                             
The condor system can allocate multi-GPUs for users. In our                    
implimentation the number of CPU cores in each computing node                  
is not important. So in condor command file, users specify                     
"machine_count" is actually specify the number of GPUs required.               
And the number of GPUs in each node is hard coded as the                       
"NUM_CPUS" in the local condor config. file in each node.                      
Honestly, we are not the condor experts. Hence we also developed               
some codes to help condor to do more complicated tasks, such as                
user quota for number of GPUs, GPU assignment, dead job cleaning, etc.         
But I guess all of these could be done by condor itself. We just               
don't know how to do, so try the somewhat stupid way to write codes            
to do those.                                                                   
Probably we can communicate the experience about this subject :)               

2010/1/7 Marian Zvada <zvada@xxxxxxxx>
Dear Condor Folks,

is there someone in Condor user's community who has build GPU cluster based on condor?
I mean someone, who has worker nodes hw with GPU graphical cards and job management is done by condor on the top.

We are very interested in this topic and would like to build such a infrastructure (condor + gpu worker nodes) for research people in our organization.
In first epoch of this project we'd like to develop standalone cluster:

- master condor head node
- 5 gpu worker nodes (each worker node 2x nVIDIA GTX295)
- storage element for data

I know, there is a lot to see on google about such a experiments, but I wanted to ask directly from condor users about their opinions/suggestions/recommendations since we are serious about to build condor gpu cluster and use it in production for our research activities.

If there is someone who has done similar setup and is willing share the knowledge, I appreciate talk about it! Any url hints are welcome too...

Thanks and regards,
