[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] K8s usage in the HTCondor community



Hello Matt, Adam,

We definitely use Kubernetes these days!  For the PATh project (https://path-cc.io/), nearly all of our central services live inside Kubernetes.

A few use cases I've seen that mix Kubernetes and HTCondor:
1.  Running the HTCondor central manager inside Kubernetes.  It's a simple, relatively static service - perhaps no interesting items there.
   - You asked about stateless: we often forget, but there is state in the central manager!  It's just fairly minimal.
2.  Running pods as backfill.  Put a HTCondor EP (execution point, aka worker node) inside a container and run it as a pod as part of a larger deployment.  When there are higher priority pods to execute, the HTCondor EP is preempted by Kubernetes.  Again, a pretty simple scheduling case.
3.  Auto-scaling HTCondor EPs when there is work to be done (see https://github.com/opensciencegrid/htcondor-autoscale-manager).  This is done on the "PATh Facility" so the hosts can be used when otherwise idle.  A Prometheus metric determines how many additional pods are needed, allowing the HPA to do its job.
   - This relies on the HTCondor "rooster" mechanism where the negotiator can annotate a ClassAd representing an offline slot as having matching jobs.  This is taken into account in a prometheus metric, triggering the HPA scale-up.
   - Feedback: the scale-down mechanism of the HPA leave quite a bit to be desired.  The EP knows when it is idle, making it a great target for scale-down or preemptively scaling down.  We solve this in the htcondor-autoscale-manager by annotating the pod iwth a preempt priority; however, it feels quite brittle to me.
4.  The NRP team has a really cool project where they submit HTCondor EPs as Kubernetes jobs.  When they're idle, the jobs finish, solving the scale-down issue nicely (though there's more work in doing the scale-up!).

For scheduling in general, I think the an interesting difference is the focus on multi-tenant scheduling in the face of scarcity; for example, if the cluster is fixed-size and always oversubscribed, how do you make resource allocation decisions?

Hope this helps,

Brian

PS -- I don't think of there as being friction between "cloud" and "batch" view of scheduling but a wonderful diversity of approaches and design priorities!

> On Nov 3, 2023, at 4:51 PM, Matthew T West via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> 
> Good Friday afternoon,
> 
> Because I like introducing CNCF folks to this community, Adam McArthur is an employee in G-Research's OSS team <https://opensource.gresearch.com/>. He is trying to understand how projects using HTCondor, amongst other traditional batch schedulers, leverage Kubernetes to deploy containers/pods for either compute hosts or services.
> 
> Of particular interest is whether k8s is still being used it it's traditional stateless manner and if not, why?
> 
> From my interactions with folks in the CNCF Batch (compute) Working Group, there seems to be some friction between how cloudy folks envision "scheduling" and what we view it as. Each side seems skeptical of the other's design philosophy and there is a bit of cross talk going on.
> 
> IIRC, the Path Facilities use Kubernetes to manage/deploy their local compute resources, correct? If anyone else uses k8s for container deployment of HTCondor daemons or for other production services, we'd love to hear more about it.
> 
> Cheers,
> Matt
> 
> P.S. - Any faults in the descriptions of either k8s or htcondor deployments are purely my own.
> 
> -- 
> Matthew T. West
> DevOps & HPC SysAdmin
> University of Exeter, Research IT
> www.exeter.ac.uk/research/researchcomputing/support/researchit
> 57 Laver Building, North Park Road, Exeter, EX4 4QE, United Kingdom
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/