[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor-CE on slurm



> On Jun 17, 2019, at 10:17 AM, Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx> wrote:
> 
> I'm trying to set up a HTC-CE instance on top of a slurm batch system.
> 
> [sdalpra0@r000u11l06-fe condor]$ rpm -qa | grep htcondor
> htcondor-ce-3.2.2-1.el7.noarch
> htcondor-ce-slurm-3.2.2-1.el7.noarch
> htcondor-ce-client-3.2.2-1.el7.noarch
> 
> I am testing it as a dteam VO member, and the following Job Router rules:
> 
> The JOB_ROUTER_ENTRIES @=jre
> [
>         name = "condor_pool_dteam";
>         GridResource = "batch slurm";
>         TargetUniverse = 9;
>         Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
>         MaxJobs = 100;
>         MaxIdleJobs = 100;
> ]
> [
>         name = "condor_pool_cms";
>         GridResource = "batch slurm";
>         TargetUniverse = 9;
>         Requirements = target.x509UserProxyVOName =?= "cms";
>         MaxJobs = 1280;
>         MaxIdleJobs = 1280;
> ]
>   @jre
> 
> A job submitted to the CE seems to be routed up to submission, where it... Disappears:
> 
> JobRouterLog says:
> 
> 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (src=18.0,dest=19.0,route=condor_pool_dteam): submitted job
> 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (src=18.0,dest=19.0,route=condor_pool_dteam): submitted job has not yet appeared in job queue mirror or was removed (submitted 0 seconds ago)
> 
> (I copy below the complete chunk of this transaction from JobRouterLog).
> 
> AFAIK current HTCondor-CE version does no more depend on a blahp rpm.
> 
> I've found this shell script: /usr/libexec/condor/glite/bin/slurm_submit.sh
> from the condor-8.8.2 rpm
> but i'm not sure this is actually invoked somewhere or somehow;

In the rpms distributed from UW-Madison, the blahp is included in the 'condor' package.
This includes the slum_submit.sh script you found.

> I would need some enlightenment on how to troubleshoot this:
> 
> - How can i see what slurm submission command is generated?
> (I added a cp $bls_tmp_file /tmp/copia_${bls_tmp_file} to see the slurm submit file but no file is created,
> thus i doubt this script is actually executed at all).

Adding this in the right place in slum_submit.sh should work. This suggests the machinery isnât getting this far.
The Job Router log shows that job 19.0 was created to do the submission into slurm. Does that job appear in condor_ce_q or condor_ce_history? If so, whatâs its status?

Is there a /var/log/condor-ce/GridmanagerLog.<user> file? That is the log file for the daemon that invokes the blahp.


> - How do i specify in the submit file the partition name? (and a few most common slurm options, i would say;
> do you have a simple example submit file for slurm?)


To specify the slurm partition, you can add this to your Condor submit file:

batch_queue = mypartition

Some common slurm options are supported out-of-the-box, and support for additional options can be added by customizing your blahp configuration and using the CERequirements job attribute. 

 - Jaime