[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor - Slurm integration



OK will try all of this outside of a glideinwms factory and make sure I have it reproducible and then will probably have more questions.

Steve Timm


From: Carl Edquist <edquist@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2020 3:17 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Steven C Timm <timm@xxxxxxxx>; Brian Lin <blin@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor - Slurm integration
 
So, Jaime has brought to my attention that apparently for "Queue" the
preferred way to specify it in the submit file is "batch_queue", which
condor recognizes and translates to the "BatchQueue" attribute in the
classad for remote batch systems.

As for the other attributes, you can specify them with a leading "+", like
"+NodeNumber = 8", and the leading "remote_" is not needed for any of
these.

Carl

On Tue, 3 Mar 2020, Brian Lin wrote:

> Hi Steve,
>
> Since this is for a CE, you'll want to use `set_remote_queue` or
> `eval_set_remote_queue` in your job router configuration. Carl's going
> to double-check that prefixing `remote_` is applicable to the other
> attributes in question.
>
> As for the remote CE requirements, HTCondor-CE 4 with HTCondor 8.8 has a
> simpler format
> (https://urldefense.proofpoint.com/v2/url?u=https-3A__htcondor-2Dce.readthedocs.io_en_latest_releases_-23400&d=DwIDbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=79F1C8FyHDRNsCc6-8ExM85S63M2bOnPcUBxUnjvyqI&s=5zORLCX9K3kXRkR6MLaLdUwf5rfU2XEhcboKvLpje1M&e= ), so you
> could set something like the following in your job route:
>
>     set_Container = "cmssw/cms:rhel7";
>     set_default_CERequirements = "Container";
>
> And use the $Container variable in your slurm_local_submit_attributes.sh.
> It's not substituted at submit time per se but rather at the time that
> Bosco/BLAHP generates the submit file.
>
> Reviewing your local submit attributes further, you can simplify some of
> those lines (this is assuming the need for the "remote_" prefix):
>
> echo "#SBATCH --account=m2612"
> --> 'set_remote_BatchProject = "m2612";' in your job route
>
> echo "#SBATCH -N 1"
> --> this is hardcoded so you can eliminate this line
>
> echo "#SBATCH -t 48:00:00"
> --> 'set_remote_BatchRuntime = 2880;' in your job route
>
> Let us know if you have any additional questions!
>
> - Brian
>
> On 3/2/20 4:19 PM, Carl Edquist wrote:
>> Hi Steve,
>>
>>> I am now using htcondor 8.9.5 and the newest bosco/blahp on the remote end
>>> (bosco 1.3.0).
>>
>> Ok, as far as I can tell the only significant addition to slurm_submit.sh
>> between condor 8.8.4 and 8.9.5 was the ability to specify a job cluster,
>> which translates to a line with "#SBATCH -M $cluster_name".  I don't see
>> that any of the parameters have gone away though.
>>
>>
>>> I tried all 5 of the parameters Carl has got here and none of them made it
>>> through into the slurm job that got submitted..
>>
>> On the condor side, I think you may need to prefix those attribute names
>> with "+remote_", if I understand correctly what I see in the manual here:
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__htcondor.readthedocs.io_en_stable_grid-2Dcomputing_grid-2Duniverse.html-23htcondor-2Dc-2Djob-2Dsubmission&d=DwIDbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=79F1C8FyHDRNsCc6-8ExM85S63M2bOnPcUBxUnjvyqI&s=XSxe0D17J726qWsWtxN4YWBBSXc1CFE5EdrulYY6CQ4&e= 
>>
>>
>>> Brian also pointed out that in 8.9 and the newer versions of htcondor-ce
>>> there is a variable substitution feature via the
>>> set_default_remotece_requirements.
>>
>>> and then modified my condor submit file to have
>>>
>>> set_default_remote_cerequirements = strcat(Container == cmssw/cms:rhel7)
>>
>> So, a couple details that catch my attention are,
>>
>> - you mention "set_default_remotece_requirements" -- maybe just a typo in
>> the email; it's "remote_cerequirements" not "remotece_requirements"
>>
>> - and, from my read of the "Setting batch system directives" section in the
>> manual that you linked, "set_default_remote_cerequirements" goes in the
>> "JOB_ROUTER_ENTRIES" configuration (defined in
>> /etc/condor-ce/config.d/02-ce-*.conf and
>> /etc/condor-ce/config.d/99-local.conf), but note that the attribute itself
>> is called "default_remote_cerequirements" (without the "set_" prefix). So,
>> i'm thinking putting "set_default_remote_cerequirements" in the submit file
>> itself might not do the right thing.
>>
>> Brian, can you confirm about whether set_default_remote_cerequirements or
>> default_remote_cerequirements can be used in a submit file?
>>
>> Thanks,
>> Carl
>>
>> On Mon, 24 Feb 2020, Steven Timm wrote:
>>
>>> I am just looking at this again now.
>>>
>>> "Queue" is a reserved word in the condor submit language so it can't
>>> possibly be used to also specify the remote queue, can it? (I got an error
>>> when I tried).
>>>
>>> I am now using htcondor 8.9.5 and the newest bosco/blahp on the remote end
>>> (bosco 1.3.0).
>>>
>>> I tried all 5 of the parameters Carl has got here and none of them made it
>>> through into the slurm job that got submitted.. I am still investigating
>>> as to why that was.
>>>
>>> Brian also pointed out that in 8.9 and the newer versions of htcondor-ce
>>> there is a variable substitution feature via the
>>> set_default_remotece_requirements.
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__htcondor-2Dce.readthedocs.io_en_latest_batch-2Dsystem-2Dintegration_-23setting-2Dbatch-2Dsystem-2Ddirectives&d=DwIDbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=79F1C8FyHDRNsCc6-8ExM85S63M2bOnPcUBxUnjvyqI&s=03OVceO_mDuBVxvmTSFkXB_jWr4fV4MxikCTePBMBwI&e= 
>>>
>>> Below is what our slurm_local_submit_attributes.sh looks like at NERSC
>>> right now.  All of those attributes can and do sometimes change.
>>>
>>>
>>> echo "#SBATCH --account=m2612"
>>> #echo "#SBATCH --reservation=xrootd_debug"
>>> echo "#SBATCH -N 1"
>>> echo "#SBATCH -q regular"
>>> echo "#SBATCH -C knl,cache,quad"
>>> echo "#SBATCH --image=cmssw/cms:rhel7"
>>> echo "#SBATCH -L cscratch1,cvmfs"
>>> echo "#SBATCH --module=cvmfs"
>>> echo "#SBATCH
>>> --volume=\"/global/cscratch1/sd/uscms/node_cache:/tmp:perNodeCache=size=680G\""
>>> echo "#SBATCH -t 48:00:00"
>>>
>>>
>>> So do I understand correctly that if
>>> I modified my script to be
>>>
>>> echo "#SBATCH --image=$Container"
>>>
>>> and then modified my condor submit file to have
>>>
>>> set_default_remote_cerequirements = strcat(Container == cmssw/cms:rhel7)
>>>
>>> that the Container variable would be substituted in at submit time?
>>>
>>> If not, then how does it work?
>>>
>>>
>>> Steve Timm
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, 30 Sep 2019, Carl Edquist wrote:
>>>
>>>> Hi Asvija,
>>>>
>>>> Brian asked me to look into this - sorry for the delay getting back to
>>>> you.
>>>>
>>>> The mappings I find based on the condor 8.8.4 version of slurm_submit.sh
>>>> are:
>>>>
>>>>         "BatchProject" ->
>>>>         #SBATCH -A $bls_opt_project
>>>>
>>>>         "BatchRuntime" ->
>>>>         #SBATCH -t $((bls_opt_runtime / 60))
>>>>
>>>>         "RequestMemory" ->
>>>>         #SBATCH --mem=${bls_opt_req_mem}
>>>>
>>>>         "Queue" ->
>>>>         #SBATCH -p $bls_opt_queue
>>>>
>>>>         "NodeNumber" ->
>>>>         #SBATCH -N $bls_opt_mpinodes
>>>>
>>>> Carl
>>>>
>>>> On Thu, 5 Sep 2019, Asvija B wrote:
>>>>
>>>>> Hi Brian,
>>>>>
>>>>> Condor version is 8.8.4
>>>>>
>>>>>
>>>>> Thanks and regards,
>>>>>
>>>>> Asvija
>>>>>
>>>>> On 9/5/2019 2:33 AM, Brian Lin wrote:
>>>>>> Hi Asvija,
>>>>>>
>>>>>> Unfortunately, there isn't much in terms for documentation but I could
>>>>>> give you a mapping if you give me the version of HTCondor you're
>>>>>> running.
>>>>>>
>>>>>> Thanks,
>>>>>> Brian
>>>>>>
>>>>>> On 8/19/19 12:12 AM, Asvija B wrote:
>>>>>>> Thanks a lot Brian... I am able to see the +remote_NodeNumber getting
>>>>>>> translated properly.
>>>>>>>
>>>>>>> Can you also please indicate the corresponding directives for other
>>>>>>> SLURM related attributes as well (like --nodes, ntasks etc.)
>>>>>>>
>>>>>>> It would be great if you can point me to some documentation related to
>>>>>>> this info..
>>>>>>>
>>>>>>> Additionally, the slurm_submit.sh file from BLAH's github directory (
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_prelz_BLAH_blob_master_src_scripts_slurm-5Fsubmit.sh&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=uSoCpZIHSkJbWZvxQFc38hmbXxpxB11Zcgi6nOZorLs&e=
>>>>>>> ) has additional capabilities of GPU support and MIC support.  Do we
>>>>>>> have any documentation which points to the corresponding Condor
>>>>>>> directives for these ?
>>>>>>>
>>>>>>> Thanks again for the information.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Asvija
>>>>>>>
>>>>>>>
>>>>>>> On 8/16/2019 8:53 PM, Brian Lin wrote:
>>>>>>>> Hi Asvjia,
>>>>>>>>
>>>>>>>> You'll want to specify '+remote_NodeNumber' in your original grid job
>>>>>>>> submit file. However, you should note that the Slurm directives we
>>>>>>>> set
>>>>>>>> will be changing in future releases of HTCondor 8.9 to the following:
>>>>>>>>
>>>>>>>> "#SBATCH --nodes=1"
>>>>>>>> "#SBATCH --ntasks=1"
>>>>>>>> "#SBATCH --cpus-per-task=$bls_opt_mpinodes"
>>>>>>>>
>>>>>>>> - Brian
>>>>>>>>
>>>>>>>> On 8/13/19 12:32 AM, Asvija B wrote:
>>>>>>>>> Dear Condor users,
>>>>>>>>>
>>>>>>>>> We are planning to use HT-Condor for submitting jobs to some of our
>>>>>>>>> SLURM managed clusters.  As I digged into the documentation, I
>>>>>>>>> understood that HT-Condor uses BLAH GAHP for supporting job
>>>>>>>>> submission
>>>>>>>>> to SLURM.
>>>>>>>>>
>>>>>>>>> We are interested in submitting MPI jobs to SLURM through HT-Condor.
>>>>>>>>> In this regard, I am unable to look at the configuration parameters
>>>>>>>>> in
>>>>>>>>> the condor submission script for indicating MPI related information
>>>>>>>>> (for eg. number of nodes etc.)
>>>>>>>>>
>>>>>>>>> I have seen the script file
>>>>>>>>> $CONDOR_HOME/libexec/glite/bin/slurm_submit.sh .  It does include
>>>>>>>>> statements with   $bls_opt_mpinodes  which translate to "SBATCH -N "
>>>>>>>>> directives.   However I am not clear about the equivalent condor
>>>>>>>>> directives that will result in the proper SLURM directives. Hence it
>>>>>>>>> would be great if any of the SLURM users can comment on this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks and regards,
>>>>>>>>>
>>>>>>>>> Asvija B
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [ C-DAC is on Social-Media too. Kindly follow us at:
>>>>>>>>> Facebook:
>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_CDACINDIA&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=uvVH3LcThEuGbesE0n2o3_BwAhhAFvrhFuoGZIVbviw&e=
>>>>>>>>> & Twitter: @cdacindia ]
>>>>>>>>>
>>>>>>>>> This e-mail is for the sole use of the intended recipient(s) and may
>>>>>>>>> contain confidential and privileged information. If you are not the
>>>>>>>>> intended recipient, please contact the sender by reply e-mail and
>>>>>>>>> destroy
>>>>>>>>> all copies and the original message. Any unauthorized review, use,
>>>>>>>>> disclosure, dissemination, forwarding, printing or copying of this
>>>>>>>>> email
>>>>>>>>> is strictly prohibited and appropriate legal action will be taken.
>>>>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> HTCondor-users mailing list
>>>>>>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>>>>>>>> with a
>>>>>>>>> subject: Unsubscribe
>>>>>>>>> You can also unsubscribe by visiting
>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=WBQKEaMHUAFVqImfbLGU1P8F_wjAZQRDNkKVZSRfaVU&e=
>>>>>>>>> The archives can be found at:
>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=sMGjIfjYSKnCI3pGrWIMpuctjLWtvfAv5yg6eFUthJ0&e=
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> [ C-DAC is on Social-Media too. Kindly follow us at:
>>>>>>> Facebook:
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_CDACINDIA&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=uvVH3LcThEuGbesE0n2o3_BwAhhAFvrhFuoGZIVbviw&e=
>>>>>>> & Twitter: @cdacindia ]
>>>>>>>
>>>>>>> This e-mail is for the sole use of the intended recipient(s) and may
>>>>>>> contain confidential and privileged information. If you are not the
>>>>>>> intended recipient, please contact the sender by reply e-mail and
>>>>>>> destroy
>>>>>>> all copies and the original message. Any unauthorized review, use,
>>>>>>> disclosure, dissemination, forwarding, printing or copying of this
>>>>>>> email
>>>>>>> is strictly prohibited and appropriate legal action will be taken.
>>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------
>>>>> [ C-DAC is on Social-Media too. Kindly follow us at:
>>>>> Facebook:
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_CDACINDIA&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=uvVH3LcThEuGbesE0n2o3_BwAhhAFvrhFuoGZIVbviw&e=
>>>>> & Twitter: @cdacindia ]
>>>>>
>>>>> This e-mail is for the sole use of the intended recipient(s) and may
>>>>> contain confidential and privileged information. If you are not the
>>>>> intended recipient, please contact the sender by reply e-mail and
>>>>> destroy
>>>>> all copies and the original message. Any unauthorized review, use,
>>>>> disclosure, dissemination, forwarding, printing or copying of this email
>>>>> is strictly prohibited and appropriate legal action will be taken.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> HTCondor-users mailing list
>>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>>>> with a
>>>>> subject: Unsubscribe
>>>>> You can also unsubscribe by visiting
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=WBQKEaMHUAFVqImfbLGU1P8F_wjAZQRDNkKVZSRfaVU&e=
>>>>> The archives can be found at:
>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=sMGjIfjYSKnCI3pGrWIMpuctjLWtvfAv5yg6eFUthJ0&e=
>>>>
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>>>> a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=WBQKEaMHUAFVqImfbLGU1P8F_wjAZQRDNkKVZSRfaVU&e=
>>>> The archives can be found at:
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwIFbA&c=gRgGjJ3BkIsb5y6s49QqsA&r=10BCTK25QMgkMYibLRbpYg&m=VCj3itsHHqD4WL7jaj14STI_RiA3yPFQuYkHOeb9zfM&s=sMGjIfjYSKnCI3pGrWIMpuctjLWtvfAv5yg6eFUthJ0&e=
>>>
>>> ------------------------------------------------------------------
>>> Steven C. Timm, Ph.D  (630) 840-8525
>>> timm@xxxxxxxx  http://home.fnal.gov/~timm/
>>> Office: Feynman Computing Center 243
>>> Fermilab Scientific Computing Division,
>>> Scientific Computing Facilities Quadrant.,
>>> Experimental Computing Facilities Dept.,
>>> Grid and Cloud Operations Group
>>>
>