[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor Cluster within Slurm Job



Hi Leslie,
CMS is using GlideinWMS.
This has a persistent private HTCondor cluster (just scheduler and main node -collector+negotiator-, no worker nodes)  and submits Glidein (pilot jobs that become HTCondor startds/worker nodes) to a variety of systems including SLURM, as needed by the jobs in the queue:
- batch systems: SLURM, PBS, HTCondor
- commercial clouds: AWS, GCE
- grid nodes: HTCondor-CE, ARC
GlideinWMS can easily be installed via yum/RPM

HTCondor can submit jobs to SLURM using the grid universe batch submisison

At Fermilab we did a test more similar to what you mention, starting the whole HTCondor cluster as batch job, on an HPC machine using Cobalt (a different job manager).
We did that because the HPC machine in question has no network connectivity from the worker nodes so we could not use the Glideins that we normally prefer.
The solution can be easily adapted to SLURM

I'd be glad to share more information about any of the above.
Feel free to send me an email
Cheers,
Marco Mambelli



> On Sep 30, 2021, at 02:26, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
> 
> maybe a different scale but CMS is running their global HTCondor pool ontop the various LRMSes in the Grid (probably not well suited to isolated HPCs,  I guess)
> https://iopscience.iop.org/article/10.1088/1742-6596/898/5/052031
> 
> On 29/09/2021 22.21, Michael Pelletier via HTCondor-users wrote:
>> Iâm interested in this idea as well - the bulk of the HPC in the combined companies is now Slurm, and thatâs intended to be the supported standard (alas), so having a bridge capability would be useful.
>> It seems like itâd be some sort of variant of HTCondor-CE with âpilotâ jobs.
>> *Michael V Pelletier*_
>> Principal Engineer
>> *
>> **Raytheon Technologies*
>> Digital Technology
>> HPC Support Team
>> ____
>> *From:* HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> *On Behalf Of *Leslie Hart - NOAA Federal via HTCondor-users
>> *Sent:* Wednesday, September 29, 2021 3:49 PM
>> *To:* htcondor-users@xxxxxxxxxxx
>> *Cc:* Leslie Hart - NOAA Federal <leslie.b.hart@xxxxxxxx>
>> *Subject:* [External] [HTCondor-users] HTCondor Cluster within Slurm Job
>> Hi,
>> Is it possible (and is there an existing recipe) to start up a "private" HTCondor Cluster within a Slurm job. We have users who would like to allocate a number of nodes and then use those nodes as an HT cluster for the duration of the job. Ideally, we could supply a few commands that they would use at the beginning and end of their Slurm batch job to start and shutdown the cluster (the middle would be comprised of a series of HTCondor jobs, of course. e.g. HTCondorStart (would figure out the nodes that Slurm has allocated and create the cluster). HTCondorWait (would wait until all HTCondor jobs completer) and HTCondorFinish (would gracefully shut down HTCondor).
>> Thanks,
>> Leslie Hart
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/