[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] K8s usage in the HTCondor community



There may be a 4th use case where I run VMs under control of HTCondor
which provides accounting of resources and statistics. The VMs may be
using 100% of CPU or not, but all have extra long instantiation times.

Valerio

On Mon, 2023-11-06 at 08:59 +0100, Jeff Templon wrote:
> Hi Matt, Adam et al,
> 
> Itâs been a while since I had contact with the cloud community to see
> how things have moved on, so take my input with a pinch of
> skepticism.
> 
> The three big differences between clouds and clusters, that influence
> what kind of questions youâd be asking your scheduler:
> 
> 1. Clouds = long instantiation times ; clusters = âjobsâ that have
> some finite run time after which they are âdoneâ
> 2. Clouds = over provisioned ; clusters = completely full
> 3. Clouds = jobs that are often doing nothing (web servers) ;
> clusters = jobs using 100% CPU
> 
> Of course these are generalisations, so please refrain from educating
> me about exceptions and edge cases; if one of the statements above is
> wrong at the 70% level, I certainly would like to hear about it.
> 
> My colleague Oxana Smirnova once said during a talk, âScheduling is
> only interesting when the system is fullâ - a brilliant one
> liner.  Whenever point 2 above holds, you immediately have a huge
> reason why cloud âschedulingâ is way different than cluster
> scheduling. 
> 
> HTH
> 
> JT
> 
> 
> > On 5 Nov 2023, at 09:13, Bockelman, Brian <BBockelman@xxxxxxxxxxxxx
> > > wrote:
> > 
> > Hello Matt, Adam,
> > 
> > We definitely use Kubernetes these days!  For the PATh project (
> > https://path-cc.io/), nearly all of our central services live
> > inside Kubernetes.
> > 
> > A few use cases I've seen that mix Kubernetes and HTCondor:
> > 1.  Running the HTCondor central manager inside Kubernetes.  It's a
> > simple, relatively static service - perhaps no interesting items
> > there.
> >   - You asked about stateless: we often forget, but there is state
> > in the central manager!  It's just fairly minimal.
> > 2.  Running pods as backfill.  Put a HTCondor EP (execution point,
> > aka worker node) inside a container and run it as a pod as part of
> > a larger deployment.  When there are higher priority pods to
> > execute, the HTCondor EP is preempted by Kubernetes.  Again, a
> > pretty simple scheduling case.
> > 3.  Auto-scaling HTCondor EPs when there is work to be done (see 
> > https://github.com/opensciencegrid/htcondor-autoscale-manager).  Th
> > is is done on the "PATh Facility" so the hosts can be used when
> > otherwise idle.  A Prometheus metric determines how many additional
> > pods are needed, allowing the HPA to do its job.
> >   - This relies on the HTCondor "rooster" mechanism where the
> > negotiator can annotate a ClassAd representing an offline slot as
> > having matching jobs.  This is taken into account in a prometheus
> > metric, triggering the HPA scale-up.
> >   - Feedback: the scale-down mechanism of the HPA leave quite a bit
> > to be desired.  The EP knows when it is idle, making it a great
> > target for scale-down or preemptively scaling down.  We solve this
> > in the htcondor-autoscale-manager by annotating the pod iwth a
> > preempt priority; however, it feels quite brittle to me.
> > 4.  The NRP team has a really cool project where they submit
> > HTCondor EPs as Kubernetes jobs.  When they're idle, the jobs
> > finish, solving the scale-down issue nicely (though there's more
> > work in doing the scale-up!).
> > 
> > For scheduling in general, I think the an interesting difference is
> > the focus on multi-tenant scheduling in the face of scarcity; for
> > example, if the cluster is fixed-size and always oversubscribed,
> > how do you make resource allocation decisions?
> > 
> > Hope this helps,
> > 
> > Brian
> > 
> > PS -- I don't think of there as being friction between "cloud" and
> > "batch" view of scheduling but a wonderful diversity of approaches
> > and design priorities!
> > 
> > > On Nov 3, 2023, at 4:51 PM, Matthew T West via HTCondor-users <
> > > htcondor-users@xxxxxxxxxxx> wrote:
> > > 
> > > Good Friday afternoon,
> > > 
> > > Because I like introducing CNCF folks to this community, Adam
> > > McArthur is an employee in G-Research's OSS team <
> > > https://opensource.gresearch.com/>. He is trying to understand
> > > how projects using HTCondor, amongst other traditional batch
> > > schedulers, leverage Kubernetes to deploy containers/pods for
> > > either compute hosts or services.
> > > 
> > > Of particular interest is whether k8s is still being used it it's
> > > traditional stateless manner and if not, why?
> > > 
> > > From my interactions with folks in the CNCF Batch (compute)
> > > Working Group, there seems to be some friction between how cloudy
> > > folks envision "scheduling" and what we view it as. Each side
> > > seems skeptical of the other's design philosophy and there is a
> > > bit of cross talk going on.
> > > 
> > > IIRC, the Path Facilities use Kubernetes to manage/deploy their
> > > local compute resources, correct? If anyone else uses k8s for
> > > container deployment of HTCondor daemons or for other production
> > > services, we'd love to hear more about it.
> > > 
> > > Cheers,
> > > Matt
> > > 
> > > P.S. - Any faults in the descriptions of either k8s or htcondor
> > > deployments are purely my own.
> > > 
> > > -- 
> > > Matthew T. West
> > > DevOps & HPC SysAdmin
> > > University of Exeter, Research IT
> > > www.exeter.ac.uk/research/researchcomputing/support/researchit
> > > 57 Laver Building, North Park Road, Exeter, EX4 4QE, United
> > > Kingdom
> > > 
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to 
> > > htcondor-users-request@xxxxxxxxxxx with a
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > > 
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to 
> > htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/