[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] GPU and condor?



Hello,

Right. Although a single GPU has a lot of cores, but in the current products of Nvidia,
all the cores in a single GPU should cooperate to do a single job. This is why it has
the Teroflop scale of performance.

But in the next generation of Nvidia product, which is called "Fermi", it has the ability
to run many different jobs concurrently in a single GPU. But the mechanism is still
unknown. I think our scheme has to be changed in order to incorporate the new kind
of GPU.

But go back to the present. Since each GPU can accept one job, so the number of
CPU cores in each node seems not so relevant. Hence, in our condor configuration,
we make condor ignore CPUs completely, and assign the number of GPUs in the
local condor config. file. Therefore, each slot appears in condor_status means a GPU.
This is a simpler way of configuration. And in our applications, they are mostly running
in GPUs. So somehow the loading of CPU is relatively low.

We also have codes which use both CPUs and GPUs. In this case we evenly divided
the number of CPU cores for each GPU. For example, a node with 8 CPU computing
cores with 2 GPUs installed, then in each slot there is one GPU, and 4 CPU cores.
This is our pre-assumption. Actually, in most cases, the number of CPU cores should
be larger (or at least equal) than the number of GPUs. So, we can designed some
script for users, to restrict the number of CPU cores they can use in each slot, for both
OpenMP or MPI.

But our scheme may not be applicable to other applications. So a more general
scheme should be able to specify a requirements of the number of GPUs and CPU
cores. But it may be more complicated.

Cheers,

T.H.Hsieh


2010/1/8 Matthew Farrellee <matt@xxxxxxxxxx>
On 01/07/2010 12:09 PM, Ian D. Alderman wrote:
>
> On Jan 7, 2010, at 9:38 AM, Miron Livny wrote:
>
>> To all GPUers out there,
>>
>> We would be very interested in hearing from you what Condor can do to
>> help you in managing GPU clusters. So far we did not find much we can
>> offer in this space. Any guidance you can provide will be most welcomed.
>>
>> Miron
>
>
> Hi,
>
> We've done work helping customers to set up policies enabling GPU
> scheduling. Our approach has been to set attributes in GPU-specific jobs
> and slot-types, and require that the attribute be set to match with
> GPU-specific slots.  Condor handles the scheduling gracefully given this
> setup.
>
> A majority of the work relates to policies.  It would be great to get
> information about the presence of the GPU, its model, and utilization,
> but we're not aware of any standard ways to do this between GPU
> vendors/models.  GPU model specific scripts can be created to advertise
> this information in the slot ads using Hawkeye/STARTD_CRON for a
> dedicated cluster.  Condor could help by offering concurrency limits for
> an individual host (e.g. this machine has a GPU_Limit=2 because it has
> only 2 GPUs), or making dynamic slots more configurable.
>
> Because of the difficulties w/automatic detection and telemetry, using
> pre-created policies seems to work well.
>
> Cheers,
>
> -Ian
>

I've put a lot of thought into how host specific concurrency limits along with dynamic slots could work to manage things like GPU resources. I was hoping to mock up an implementation over the holidays but ended up just relaxing instead. If you're interested in such functionality let me know and I'll share my thoughts with you.

Best,


matt
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/