[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] metascheduler anybody ?

Hi Christoph,

On 10 Jun 2022, at 10:14, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:


every now and then we think about installing a kind of more universal meta scheduler in order to give useres a more generic, formal entry point to the different computing pools we provide (HTC/HPC, condor/slurm, local pool/remote pool etc.) 

Is anybody out there planning to uses such a setup or maybe has already a protoype or a running implementation of something alike ? 

There was a long discussion between various people at Nikhef and SURF, and members of the HTCondor team (among which Miron, Todd T, and Greg Thain) in 2020/2021 where this came up.  The idea was to make a prototype, and the experience was the same as for many things : not enough people to work on all the interesting possibilities out there.  

Here is an excerpt from the discussion (this part written by me):

And on the topic of rethinking policies, this whole discussion brings back something that weâve parked for some time, which is what do we do in the cloud era? It seems like the default âcloud schedulingâ is just to give people VMs and hope they give it up someday. I donât know what the current thinking is inside the experiments, but a few years ago, this is what they were hoping for : that they would just âgetâ VMs that they could keep indefinitely. This circles back around to the discussion here:

  • we donât mind giving a group like ATLAS VMs that they could keep forever, inside their allocation. We promised those cycles to them, how they use them is their own business
  • when we have leftover cycles we donât mind if groups like ATLAS use them â please do! But now the story is different â we want the lease to be time limited, in case the customer to whom weâve promised those cycles shows up. Time limitation (you can have this slot for 30 hours) is the way we know here, another solution would be pre-emption or a sort of checkpointing (snapshot the VM and stop it, possibly to restart later) - weâd need a way to make it clear to the experiment that these kind of VMs are not the same as the âallocated VMsâ.
Does Condor have any mechanism to be a scheduler of VMs? Once we get all this policy business sorted out, it would be a shame to have to re-invent it in another tool for the cloud side of the world. Something to ponder and maybe explore.

Seems to me to be a useful thing, the thing you are proposing.  The big question is if you want to go for reduced scope and build a system tailored to DESY, or do you want to do build something capable of implementing various policies?  The site-to-site differences in scheduling are enormous, at least for those sites that support many (Iâd say âmanyâ means greater than four) active groups, each with their own allocation.