[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor and GPUs



Hi,

Sorry for the long post, but I have looked at many way of using SMP
computer with version 7.0.1 of condor as it is what we have(computer
with 2 CPU quad core). Here is the current limitation that I found:

1)We need each computer to have a pool of resource. Each job that is
started on the machine remove the resource it use from the pool. When
it end, it put them back. Currently, we must divide in advance the
pool of ressource in slot. This is not optimal, as you need to the in
advance what the jobs will need as resource to make a good split. This
is needed for the cpu and the memory. Also, some people(like me) would
like to create additional resource like GPU, licence, ...

One way to implement it for the memory is with a macro like(need
something similar for CPU):

TotalMemoryAvailable=$(TotalMemoryUsed)-$(TotalMemory)
TotalMemoryUsed=slot1_ImageSize+slot2_ImageSize+...

Then in the Requirement you add: (TotalMemoryAvailable< MemoryNeeded)
&& (target.Memory>0)

(target.Memory>0) is currently needed as if target.Memory is not
present, condor will add some restriction.

I had trouble in doing this. The trouble with this is that
slotX_ImageSize is not an up to date version of the value in each
slot. In one case when I did "condor_status -l hostname|grep
ImageSize", I get:

[for slot1]
ImageSize = 1588508
MonitorSelfImageSize = 30104.000000
slot1_ImageSize = 1588508
slot2_ImageSize = 1588508
[for slot2]
ImageSize = 29500
MonitorSelfImageSize = 30104.000000
slot1_ImageSize = 0
slot2_ImageSize = 29500

If condor_status return the same value as the one seen while doing the
negotiation, this will not work correctly as all slot don't see the
same version. If I wait 5 minutes the value are correct.

An ideal world, each SMP have a pool of available resource. When a job
have all needed resource, it create a slot and it execute it their.
One way of doing it without too much modification to condor is to
generate in advance the slot(ex: one for each cpu) and allow their
execution only when the pool of resource have what is needed.

2)I had another trouble for SMP machine. If I want to add a custom
resource (ex: I have an SMP machine with 1 GPU card, so only 1 jobs
should use the GPU), this resource can't be seen with the slotX_VAR
structure. Example, if I do (Change IOJob for GPUJob if you prefer)

STARTD_JOBS_EXPRS = IOJob
STARTD_SLOT_EXPRS = IOJob, ImageSize

with:
condor_status -l hostname|grep slot

I have slotX_ImageSize that show up, but not slotX_IOJob.

currently what I do is to preallocate some slot to use the IO
ressource, but this is not optimal.

This is the two current limitation that I see. The first one is the
priority for me. Or meaby you could add variable like
pool_TotalMemoryUsed, that use the up to date version or you could
hardcode TotalMemoryUsed. If you hardcode it, this won't solve the
issue with custom ressource.

Hope this will help you.

Frédéric Bastien

On Wed, Jul 2, 2008 at 5:55 AM, Miron Livny <miron@xxxxxxxxxxx> wrote:
> All,
>
> How should we go about collecting the requirements for the most basic
> support we should offer in this space. What I am looking for is input
> that will help us pick the right cost/benefit ratio for an effort to
> support GPUS and/or multi-core nodes.
>
> We are definitely interested in enhancing Condor in this direction.
> Help in deciding what we should do first is most welcome.
>
> Miron
>
>  At 04:13 AM 7/2/2008, Steffen Grunewald wrote:
>>On Wed, Jul 02, 2008 at 09:59:47AM +0100, Mark Calleja wrote:
>> > Hi All,
>> >
>> > Just out of interest, is anyone using Condor in conjunction with
>> > graphical processing units, especially those supporting CUDA? Are there
>> > any plans by the Condor project to support such platforms? In a related
>> > vein, what about Cell processors? Any plans to exploit these via Condor?
>>
>>While these two sound like ambitious projects (and certainly feasible,
>>given enough manpower), some of us would like to see some multi-core
>>support first (which, for me, would include flexible management not only
>>of the multiple cores but also the memory shared by them, threading, etc.).
>>
>>Cells *are* already supported (there's Condor for YellowDog Linux), but
>>AFAICT only for the PPE. SPEs would accept only special code not everyone
>>is able to write now; but in a certain sense that's similar to the
>>thread/multicore issue (what one needs is a manager for those resources).
>>If you just run a single app on the Cell PPE that knows how to check for
>>free SPEs etc you're set...
>>
>>Steffen
>>_______________________________________________
>>Condor-users mailing list
>>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>>subject: Unsubscribe
>>You can also unsubscribe by visiting
>>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>>The archives can be found at:
>>https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>