[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Using BLAH to submit/monitor/handle jobs to different slurm clusters (Cross-Cluster Operations)



Hi,

This is also covered here:

https://opensciencegrid.org/docs/compute-element/job-router-recipes/#setting-batch-system-directives

However, as you mention in your original email, this only effects batch submission -- not when monitoring or canceling jobs.

Brian

On Jun 11, 2019, at 3:46 PM, George Papadimitriou <georgpap@xxxxxxx> wrote:

Hi Steve,

In Pegasus we are dealing with this by adding a parameter in CErequirements and parsing it on the remote site in the slurm_local_submit_attributes.sh
I have attached a .sub file and this is the link to the pegasus slurm_local_submit_attributes.sh: https://github.com/pegasus-isi/pegasus/blob/master/share/pegasus/htcondor/glite/slurm_local_submit_attributes.sh

On the remote_cequirements section of the .sub file there is a parameter called "EXTRA_ARGUMENTS", with multiple slurm batch commands. If you check Pegasus' slurm local attributes, towards the end of the script is the section parsing this. 
You can check out Pegasus online docs for more information https://pegasus.isi.edu/documentation/glite.php

Regards,
George


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Steven C Timm <timm@xxxxxxxx>
Sent: Tuesday, June 11, 2019 10:06:26 AM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Using BLAH to submit/monitor/handle jobs to different slurm clusters (Cross-Cluster Operations)
 
We (Fermilab) have been doing this in an ad-hoc way for a while at NERSC by installing multiple different bosco clients in different sub directories on the NERSC side and passing an extra argument to tell bosco to use a different subdirectory.. this is how we kept Cori and Edison separate and also how we differentiate between the "KNL" and "Haswell" nodes of Cori as well.

We would much prefer a less hacky way to do things.  Really the most general would be 
to be able to push an arbitrary  line of SLURM batch commands into the final slurm submission file.
There are some features on bosco  already to pass through some of the slurm parameters (particularly memory and node count) but we haven't had time to test them yet.

Steve Timm




From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, June 11, 2019 11:52:22 AM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Using BLAH to submit/monitor/handle jobs to different slurm clusters (Cross-Cluster Operations)
 
On Jun 7, 2019, at 7:52 AM, Jaime Frey <jfrey@xxxxxxxxxxx> wrote:

On Jun 6, 2019, at 3:31 PM, George Papadimitriou <georgpap@xxxxxxx> wrote:

These days there are execution sites (eg. NERSC) that have multiple job management clusters to facilitate specific user needs, such as data transfers, computations etc.
For example NERSC on Cori has two slurm clusters "cori" and "escori", with the first one being used for compute and the second one for other tasks, like transferring data.

I'm currently trying to setup a BOSCO submit node that uses ssh to submit/monitor/modify jobs at NERSC, and also take advantage of both Slurm clusters.
However I've stumbled upon an issue regarding the monitoring and modifying of the submitted jobs.
Even though I was able to specify the cluster I want submit the job with Slurm's #SBATCH -M argument, I couldn't find a way to pass this to the rest of the operations (eg. status, cancel etc.)
As a result I cannot interact correctly with the submitted jobs to the "escori" cluster (the non-default one).

Is there a way to handle this?

BOSCO can submit to multiple partitions within Slurm, but submitting to multiple clusters is new for us. We will have to research what would be involved to support this setup. It looks like modifying BOSCOâs slurm scripts to add -M all may work. Iâll follow up with you off the list once I investigate some more.

It looks like we can add support for multiple Slurm clusters fairly easily. We are beginning work on this to be included in an upcoming release. If anyone on this list is interested in this feature, let us know.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project

<namd2_ID0000001.sub>_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/