[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Backfill on an OpenStack system



Hi Matt,

Sorry, I only now came across this thread.
Let me add one possibility on how to do what you want to the ways on how to do it shown by others:

At the University of Victoria we developed for the reason you mentioned Cloudscheduler. What it does is that it looks to an HTCondor instance and if there are jobs in there determines the job resource requests and then starts a VM with enough resources on a cloud; if there are no more jobs then the VMs get terminated. In this system, jobs see a normal batch system and and the worker nodes are started on demand by Cloudscheduler; opportunistic usage is possible in a way that when there are VMs started outside of Cloudscheduler and the total core usage in the cloud project is above a configured limit, then Cloudscheduler will automatically retire the VMs (lets jobs finish but doesn't allow HTCondor to start new jobs) and terminates those VMs once no more jobs are running on it. Cloudscheduler is fully accessible via web interface as well as cli.
For reference:
https://link.springer.com/epdf/10.1007/s41781-020-0036-1
(in full support and developing still new features, so some information from the publication may have changes since then) It's Opensource and we would be happy to assist anyone in setting up an own instance:
https://github.com/hep-gc/cloudscheduler

We run this successfully since many years using Openstack systems around the world and also commercial clouds like Amazon. Jobs we run are mostly for Physic's experiments like Atlas, Belle-II, and Dune, and we also run the Cloudscheduler instance as a service for others which then only provide their own HTCondor instance if wanted.

Cheers,
  Marcus

On Sat, 4 Sep 2021, West, Matthew wrote:

Hi All,

Here at Exeter, IT is setting up an OpenStack system to support researchers who want DRAM heavy bespoke workstation-like environments. Because I don't expect the system to be full up with active users 24/7, I am wondering what the optimal way to setup an HTCondor pool on it to run jobs as backfill. Would this be similar to how you would do it for any other spare resources: have a VM start up on a node and announce itself to the collector daemon as an available worker if idle conditions of the machine are met?

It reminds me of the method to expand one's resources into corporate cloud servers but I am not sure what tools are useful in this case.

Cheers,
Matt