[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor and GPUs



All,

Another question that will help me form a response to your very thoughtful and useful input . What are your thoughts about time limits on the execution. In other words, if a jobs says "I need 4 cores and 2.5 G of memory and 7.3 M of network bandwidth" will it also say "for 2 hours"? If does, what are you going to do if the jobs uses more or less than 2 hours?

Thanks,

Miron

 At 04:43 AM 7/4/2008, Mark Calleja wrote:
Our experiences and requirements broadly echo what's already been
discussed in this thread. The need for a form of dynamic multi-core
support was raised at the HTC workshop in Edinburgh last Autumn. Ideas
were converging to what was mentioned in Ian Chesal's post in this
thread, i.e. the Startd would subtract whatever's being currently used
by a multi-core job and advertise the remainder. Such support would
allow for discrimination of jobs on numbers of cores used, e.g. I may
want to preempt a current job if another one that makes use of more
cores comes along.

One way that our users make use of multi core machines is to run
parallel/MPI jobs that don't span more than one SMP host.
Performance-wise these are great, but because they operate under the
parallel universe they necessitate a dedicated scheduler, which is a
pain in our flocked environment (we currently have 13 pools and
climbing); ideally we'd like this restriction removed for such
single-host jobs.

On the GPU front, quite a few of our users have caught the bug and would
like Condor to recognise a video card that can be scavenged, and would
suitably advertise 'cuda, 'ati', 'OpenCL', etc., maybe even the card
type itself. However, we realise that this is quite a new and fast
moving field, and so tricky for the Condor team to work against.

Just my/our two cents worth...

Mark

Steffen Grunewald wrote:
> On Thu, Jul 03, 2008 at 11:41:50AM -0400, Ian Chesal wrote:
>
>>> Exactly. Pre-defined static slots have to be replaced by
>>> something that's aware of the whole picture (sees the
>>> *machine* that hosts the slots which can be dynamically
>>> reconfigured, or even created on demand).
>>>
>> I'm at the point where this is almost becoming a need, not a want. We're
>> parallelizing code left right and center to take advantage of multi-core
>> CPUs and our admins are flipping configurations on an almost daily basis
>> to load balance parallel v. serial jobs in our pools.
>>
>
> IMHO multi-core support, and the possible addition of other resources
> to manage, is the most pressing need these days (and Intel's recent
> announcement cited at
> http://www.c0t0d0s0.org/archives/4571-Intel-finally-admits-it....html
> seems to give us only a couple of years).
> GPU support will automatically show up when the set of resources is
> extended, as will SPE support for the Cell.
>
>
>>> Of course, with dynamic slot creation, another problem comes along:
>>> If a machine is already partially taken, how to define a
>>> ranking among machines to allow for maximum flexibility in
>>> the future?
>>>
>> Prior to using Condor our home grown sol'n allowed for dynamic
>> machine/slot allocations *but* we handled the scenario you described by
>> simplifying things down to only a handful of constraints per job: OS,
>> number of CPUs required, memory. We always negotiated for the biggest
>> CPU, biggest memory request jobs in the queue first. Taking the approach
>> that you fill the jar with rocks, then pebbles, then sand.
>>
>
> Future Condor development/extension should be aware of multi-threaded
> applications (requiring multi-core slots!). -> Resource "Thread count".
> Currently I've got a user who is running 2-thread apps, and forcing his
> way onto the corresponding nodes by faking a huge memory requirement.
> That's ugly but ATM the only way to get a single-slot machine with
> 2 CPU cores.
> Needless to say that the CPUs are wasted when he's not around.
>
> What Carsten suggested in a previous mail: have "network bandwidth" as
> a resource, too. In particular with multi-core (multi >> 2) machines,
> this cannot be neglected anymore.
>
> On the other hand, MPI applications would profit from locality, even
> in a NUMA setup (as multiple Opterons would provide: their HT links
> are faster than the standard network which is attached using another
> HT anyway).
> Other apps would prefer to be matched different machines as long as this
> is possible (to spread the generated heat and reduce silicon wear).
>
>
>> I certainly don't envy the Condor Team -- I know Derek has talked about
>> adaptive machine setups but how it'd work in the face of all those
>> constraints I can't imagine. Maybe it'd make a good thesis? Who ever
>> does get this into Condor is my hero though. :)
>>
>
> :-) Indeed, sounds like a major transition. The "vm -> slot" one was
> nothing compared with that, and it took quite a while...
>
>
>>> IMHO all boils down to dynamic slot definition. Something
>>> that would no longer happen on the execute node but on the master ...
>>>
>> Interesting. So the startd's would tell the collector what they have in
>> total. And the negotiator would read this. Assign a job. Subtract what
>> the job estimates it will use or what it says it wants, and updates the
>> ad in the collector for the machine. Sort of a "best guess" ad. And then
>> the startd can correct anything the negotiator got wrong at a later
>> point in time. Interesting...
>>
>
> Actually, since there is no defined number of slots anymore, there probably
> would be only one startd per machine.
> But basically you're right. Currently, "only" the shadow gets notified of
> changes of the memory footprint. In the future, the whole Condor should
> know about it.
> This undoubtedly requires users to be honest about their actual requirements,
> and (to enforce them to be honest) mechanisms to track them.
>
>
>> Count me among the Condor users who really, really needs dynamic machine
>> slots. Multi-core machines and parallel software are the future in the
>> EDA industry.
>>
>
> All available hands raised here.
>
> Cheers,
>  Steffen
>
>

--
Cambridge eScience Centre, University of Cambridge
Centre for Mathematical Sciences, Wilberforce Road, Cambridge CB3 0WA
Tel. (+44/0) 1223 765317, Fax  (+44/0) 1223 765900
http://www.escience.cam.ac.uk/~mcal00

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/