[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor-CE on slurm



Hello,
I'm trying to set up a HTC-CE instance on top of a slurm batch system.

[sdalpra0@r000u11l06-fe condor]$ rpm -qa | grep htcondor
htcondor-ce-3.2.2-1.el7.noarch
htcondor-ce-slurm-3.2.2-1.el7.noarch
htcondor-ce-client-3.2.2-1.el7.noarch

I am testing it as a dteam VO member, and the following Job Router rules:

The JOB_ROUTER_ENTRIES @=jre
[
ÂÂÂÂÂÂÂ name = "condor_pool_dteam";
ÂÂÂÂÂÂÂ GridResource = "batch slurm";
ÂÂÂÂÂÂÂ TargetUniverse = 9;
ÂÂÂÂÂÂÂ Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
ÂÂÂÂÂÂÂ MaxJobs = 100;
ÂÂÂÂÂÂÂ MaxIdleJobs = 100;
]
[
ÂÂÂÂÂÂÂ name = "condor_pool_cms";
ÂÂÂÂÂÂÂ GridResource = "batch slurm";
ÂÂÂÂÂÂÂ TargetUniverse = 9;
ÂÂÂÂÂÂÂ Requirements = target.x509UserProxyVOName =?= "cms";
ÂÂÂÂÂÂÂ MaxJobs = 1280;
ÂÂÂÂÂÂÂ MaxIdleJobs = 1280;
]
 @jre

A job submitted to the CE seems to be routed up to submission, where it... Disappears:

JobRouterLog says:

06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (src=18.0,dest=19.0,route=condor_pool_dteam): submitted job 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (src=18.0,dest=19.0,route=condor_pool_dteam): submitted job has not yet appeared in job queue mirror or was removed (submitted 0 seconds ago)

(I copy below the complete chunk of this transaction from JobRouterLog).

AFAIK current HTCondor-CE version does no more depend on a blahp rpm.

I've found this shell script: /usr/libexec/condor/glite/bin/slurm_submit.sh
from the condor-8.8.2 rpm
but i'm not sure this is actually invoked somewhere or somehow;

I would need some enlightenment on how to troubleshoot this:

- How can i see what slurm submission command is generated?
(I added a cp $bls_tmp_file /tmp/copia_${bls_tmp_file} to see the slurm submit file but no file is created,
thus i doubt this script is actually executed at all).

- How do i specify in the submit file the partition name? (and a few most common slurm options, i would say;
do you have a simple example submit file for slurm?)

My submit file is:

[sdalpra@ui-htc slurm_cn]$ cat testp308.sub
# Required for local HTCondor-CE submission
universe = vanilla
use_x509userproxy = true
+Owner = undefined

# Files
executable = p308/htcp308
output = htcp308.out
error = htcp308.err
log = htcp308.log
arguments = "0 0 1 1001"
# File transfer behavior
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
transfer_output_files = htcp308.err, htcp308.out
queue

#########

Thanks,
Stefano



06/17/19 15:23:56 (D_ALWAYS:2) === Current Probing Information ===
06/17/19 15:23:56 (D_ALWAYS:2) fsize: 5111ÂÂÂÂÂÂÂÂÂÂÂÂÂ mtime: 1560777826
06/17/19 15:23:56 (D_ALWAYS:2) first log entry: 7 CreationTimestamp 1559046390 06/17/19 15:23:56 (D_ALWAYS) JobRouter: Checking for candidate jobs. routing table is: Route NameÂÂÂÂÂÂÂÂÂÂÂÂ Submitted/MaxÂÂÂÂÂÂÂ Idle/MaxÂÂÂÂ Throttle Recent: Started Succeeded Failed condor_pool_cmsÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0/ÂÂ 1280ÂÂÂÂÂÂ 0/ÂÂ 1280 noneÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0 condor_pool_dteamÂÂÂÂÂÂÂÂÂÂÂÂÂ 0/ÂÂÂ 100ÂÂÂÂÂÂ 0/ÂÂÂ 100 noneÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter: Umbrella constraint: ((target.x509userproxysubject =!= UNDEFINED) && (target.x509UserProxyExpiration =!= UNDEFINED) && (time() < target .x509UserProxyExpiration) && (target.JobUniverse =?= 5 || target.JobUniverse =?= 1)) && ( (target.x509UserProxyVOName is "cms") || ((regexp("dteam",TARGET.x509UserProxyVoName))) Â) && (target.ProcId >= 0 && target.JobStatus == 1 && (target.StageInStart is undefined || target.StageInFinish isnt undefined) && target.Managed isnt "ScheddDone" && target.Man aged isnt "External" && target.Owner isnt Undefined && target.RoutedBy isnt "htcondor-ce") 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter: Found candidate job src=18.0,route=condor_pool_dteam 06/17/19 15:23:56 (D_ALWAYS:2) SharedPortClient: sent connection request to schedd at <130.186.17.136:9619> for shared port id 723505_074b_3 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (src=18.0,route=condor_pool_dteam): claimed job 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Copying attribute RequestCpus to orig_RequestCpus 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Copying attribute environment to orig_environment 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Copying attribute OnExitHoldSubCode to orig_OnExitHoldSubCode 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Copying attribute OnExitHold to orig_OnExitHold 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Copying attribute OnExitHoldReason to orig_OnExitHoldReason 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Deleting attribute TotalSubmitProcs 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Deleting attribute CondorCE 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Deleting attribute PeriodicRemove 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute JobMemory 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute RequestMemory 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute JOB_GLIDEIN_Memory 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute osg_environment 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute requirements 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute GlideinCpusIsGood 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute OnExitHoldReason 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute OnExitHold 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute JobIsRunning 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute OnExitHoldSubCode 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute RoutedJob 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute RequestCpus 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute JobCpus 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute CondorCECollectorHost 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute JOB_GLIDEIN_Cpus 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute remote_queue to an evaluated expression 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute remote_OriginalMemory to an evaluated expression 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute OriginalMemory to an evaluated expression 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute environment to an evaluated expression
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00191ms] Owner --> a07cms04
06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.09513ms] userHome(Owner,"/") --> /marconi/home/usera07cms/a07cms04 06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00310ms] CondorCECollectorHost --> r000u11l06-fe.marconi.cineca.it:9619 06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00215ms] orig_environment --> 06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00215ms] osg_environment --> 06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.00215ms] orig_environment --> 06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.04911ms] strcat(osg_environment," ",orig_environment) --> 06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.09990ms] ifThenElse(orig_environment is undefined,osg_environment,strcat(osg_environment," ",orig_environment)) --> 06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.30994ms] strcat("HOME=",userHome(Owner,"/")," CONDORCE_COLLECTOR_HOST=",CondorCECollectorHost," ",ifThenElse(orig_environment is undefined,osg_environment,strcat(osg_environment," ",orig_environment))) --> HOME=/marconi/home/usera07cms/a07cms04 CONDORCE_COLLECTOR_HOST=r000u11l06-fe.marconi.cineca.it:9619 06/17/19 15:23:56 (D_ALWAYS:2) Classad debug: [0.35000ms] strcat("HOME=",userHome(Owner,"/")," CONDORCE_COLLECTOR_HOST=",CondorCECollectorHost," ",ifThenElse(orig_environment is undefined,osg_environment,strcat(osg_environment," ",orig_environment))) --> HOME=/marconi/home/usera07cms/a07cms04 CONDORCE_COLLECTOR_HOST=r000u11l06-fe.marconi.cineca.it:9619 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute remote_SMPGranularity to an evaluated expression 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute remote_NodeNumber to an evaluated expression 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute remote_cerequirements to an evaluated expression 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (route=condor_pool_dteam): Setting attribute OriginalCpus to an evaluated expression 06/17/19 15:23:56 (D_ALWAYS:2) SharedPortClient: sent connection request to schedd at <130.186.17.136:9619> for shared port id 723505_074b_3 06/17/19 15:23:56 (D_ALWAYS:2) SharedPortClient: sent connection request to local schedd for shared port id 723505_074b_3 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (src=18.0,dest=19.0,route=condor_pool_dteam): submitted job 06/17/19 15:23:56 (D_ALWAYS:2) JobRouter (src=18.0,dest=19.0,route=condor_pool_dteam): submitted job has not yet appeared in job queue mirror or was removed (submitted 0 seconds ago)
06/17/19 15:24:06 (D_ALWAYS:2) JobRouter: polling state of (1) managed jobs.
06/17/19 15:24:06 (D_ALWAYS:2) TimerHandler_JobLogPolling() called
06/17/19 15:24:06 (D_ALWAYS:2) === Current Probing Information ===
06/17/19 15:24:06 (D_ALWAYS:2) fsize: 11084ÂÂÂÂÂÂÂÂÂÂÂÂ mtime: 1560777844
06/17/19 15:24:06 (D_ALWAYS:2) first log entry: 7 CreationTimestamp 1559046390 06/17/19 15:24:06 (D_ALWAYS) JobRouter: Checking for candidate jobs. routing table is: Route NameÂÂÂÂÂÂÂÂÂÂÂÂ Submitted/MaxÂÂÂÂÂÂÂ Idle/MaxÂÂÂÂ Throttle Recent: Started Succeeded Failed condor_pool_cmsÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0/ÂÂ 1280ÂÂÂÂÂÂ 0/ÂÂ 1280 noneÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0 condor_pool_dteamÂÂÂÂÂÂÂÂÂÂÂÂÂ 1/ÂÂÂ 100ÂÂÂÂÂÂ 1/ÂÂÂ 100 noneÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0 06/17/19 15:24:06 (D_ALWAYS:2) JobRouter: Umbrella constraint: ((target.x509userproxysubject =!= UNDEFINED) && (target.x509UserProxyExpiration =!= UNDEFINED) && (time() < target.x509UserProxyExpiration) && (target.JobUniverse =?= 5 || target.JobUniverse =?= 1)) && ( (target.x509UserProxyVOName is "cms") || ((regexp("dteam",TARGET.x509UserProxyVoName))) ) && (target.ProcId >= 0 && target.JobStatus == 1 && (target.StageInStart is undefined || target.StageInFinish isnt undefined) && target.Managed isnt "ScheddDone" && target.Managed isnt "External" && target.Owner isnt Undefined && target.RoutedBy isnt "htcondor-ce") 06/17/19 15:24:06 (D_ALWAYS:2) SharedPortClient: sent connection request to schedd at <130.186.17.136:9619> for shared port id 723505_074b_3
06/17/19 15:24:06 (D_ALWAYS:2) Setting RoutedToJobId = "19.0"
06/17/19 15:24:06 (D_ALWAYS:2) JobRouter (src=18.0,dest=19.0,route=condor_pool_dteam): updated job status 06/17/19 15:24:07 (D_ALWAYS:2) JobRouter: Evaluating all managed jobs periodic job policy expressions. 06/17/19 15:24:07 (D_ALWAYS:2) JobRouter: Evaluated all managed jobs periodic expressions.
06/17/19 15:24:16 (D_ALWAYS:2) JobRouter: polling state of (1) managed jobs.
06/17/19 15:24:16 (D_ALWAYS:2) TimerHandler_JobLogPolling() called