[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor within Slurm?

On 7/13/19 10:03 AM, Steffen Grunewald wrote:
Hello all,

I've been asked to install HTCondor on a HPC cluster running Slurm.
While this sounds crazy to me, I might just be ignorant, so I'd like
to ask here before denying the request - has it been done somewhere
else, for whichever reason, and if you did it, would you like to
share your insights?


We don't think this is crazy at all. The fundamental idea of High Throughput Computing is to be able to use as many machines as possible, whether they are dedicated to the purpose, sometimes-idle machines you can "borrow" from someone else, cloud machines you can rent for money, or others. Several sites, including here at the UW, backfill slurm clusters with jobs from HTCondor systems.

There are two ways to do this. This first involves running a HTCondor worker node setup on the SLURM clusters work nodes, but only activating it when SLURM tells us it is idle. The slurm prologue and epilogue hooks are helpful here. Example scripts with PBS, that work pretty much the same with slurm are available on our wiki site here: https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToScavengeCycles The advantage of this approach is that it is easy to set up, easy to debug from the condor side. The disadvantage is that slurm doesn't know about these jobs, so it cannot account for them or make scheduling decisions about them. Like any federated systems, the jobs need to be prepared to run in a "foreign" environment, with perhaps a different Linux distro, different locally installed software, etc. Generally, we configure the start expressions on these machines so that users have to opt-in to using them, to minimize surprises.

A second way is more complicated to set up, but gives slurm more visibility to the jobs. This method relies on the job router to convert vanilla condor jobs in the condor's schedd to grid jobs that go to slurm, and then the slurm scheduler sees these as jobs, and can schedule them as it sees fit, and accounts for them in the usual way.

We'd be happy to give you a hand to help set up either of these methods.