[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Dynamic Slots & Parallel Universe



This note is to offer more support for these enhancements. Back in Oct 2009 I asked a questioned about Nodal Affinity when provisioning slots. This has become important in how some research on our campus involving molecular dynamics. Other Batch Scheduling Subsystems can handle the scheduling of MPI-based processes to cores (slots) on the same node to overcome the realities of high inter-nodal latency hits, especially for clusters not fortunate enough to have Infiniband interconnects.

Here is a pointer to my original note: https://www-auth.cs.wisc.edu/lists/condor-users/2009-October/msg00134.shtml

--Brandon

Erik Erlandson wrote:
Hi David,

I do not know how it will be prioritized relative to all the other
development in the queue.  It's a relatively significant change to the
dedicated scheduler, so I know the UW team expects to do a thorough
review and testing before approving it for inclusion.

There are some other users who are interested in having this enhancement
and so I will make sure it doesn't fall off the radar.

-Erik


On Tue, 2010-08-31 at 10:04 -0500, David J. Herzfeld wrote:
  
Hi Erik:

Thanks for the response. From the remarks in the ticket, this looks to
be exactly what we want to #3! Is there any estimate on when this will
get incorporated into the stable release?

This is exciting.

David

On 08/31/2010 09:42 AM, Erik Erlandson wrote:
    
Regarding dynamic slots and parallel universe:  The dedicated scheduler
(used by PU jobs) does not currently handle dynamic slots correctly.   A
patch to correct this has been submitted and is pending review:

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=986,0


-Erik



On Tue, 2010-08-31 at 08:56 -0500, David J. Herzfeld wrote:
      
Hi All:

We have currently been working on a 1024 core cluster (8 cores per
machines) using a pretty standard Condor config. Each core shows up as a
single slot, etc.

Users are starting to use multi-process jobs on the cluster - leading to
over scheduling. One way to combat this problem is the "whole machine"
configuration presented on the Wiki at
<https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots>.
However, most of our users don't require the full machine (combinations
of 2, 3, 4, 5.. cores). We could modify this config to supply slots for
1/2 a machine, etc.

So a couple of questions:
1) Does this seem like a job for dynamic slots? or should we modify the
"whole machine" config?

2) If dynamic slots are the way to go, has this shown to be stable in
production environments?

3) Can we combine the dynamic slot allocations with the Parallel
Universe to provide similar-to-PBS allocations. Something like
machine_count = 4
request_cpus = 8

To match 4 machines with 8 CPUs a piece? Similar to
#PBS -l nodes=4:ppn=8

As always - thanks a lot!
David
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
        
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
      


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
  

-- 
Brandon Leeds

Lehigh University
Sr. Computing Consultant, LTS
High Performance Computing

Phone: (610) 758-4805