[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] requiring gpu through a HTCondor-CE



Hi Stefano,

Are you asking to pass arbitrary requirements from the incoming CE job 
to the routed job? I don't think that's actually an appropriate thing to 
do since jobs submitted to CEs are generally pilots that shouldn't have 
to be aware of the policy or underlying resources. I'd let whatever 
pilot system you support report back with the resources it finds where 
the end users can then specify their own requirements (i.e., which CUDA 
version they need).

But if you really want to do it, you could add this snippet to your job 
router entries:

 ÂÂÂ copy_requirements = "original_requirements"
 ÂÂÂ eval_set_requirements = original_requirements

If say you wanted all of a single VO's pilots to have a GPU, your job 
router entries could look like this:

 ÂÂÂ [ requirements = (x509UserProxyVOName =?= "ligo")
 ÂÂ Â Â set_requirements = (CUDACapability >= 1.2) && (CUDADeviceName 
=?= "Tesla K40m")
 Â Â ÂÂ ...
 ÂÂÂ ]

- Brian

On 3/25/19 12:35 PM, Stefano Dal Pra wrote:
> Hello,
>
> I have an exec node equipped with two GPUs:
> [root@hpc-200-06-07 ~]# /usr/libexec/condor/condor_gpu_discovery 
> -properties
> DetectedGPUs="CUDA0, CUDA1"
> CUDACapability=3.5
> CUDADeviceName="Tesla K40m"
> CUDADriverVersion=10.0
> CUDAECCEnabled=true
> CUDAGlobalMemoryMb=11441
> CUDA0DevicePciBusId="0000:***"
> CUDA0DeviceUuid="0caa****"
> CUDA1DevicePciBusId="0000:***"
> CUDA1DeviceUuid="158****"
>
> The host can be identified through requirements:
>
> [root@ce02-htc ~]# condor_status -constraint '((CUDACapability >= 1.2) 
> && (CUDADeviceName =?= "Tesla K40m")) && (Arch == "X86_64") && (OpSys 
> == "LINUX")'
> NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ OpSysÂÂÂÂÂ ArchÂÂ State Activity 
> LoadAv MemÂÂÂÂ ActvtyTime
>
> slot1@wn-01-02-03**** LINUXÂÂÂÂÂ X86_64 Unclaimed IdleÂÂÂÂÂ 0.000 
> 128737Â 3+01:23:46
>
> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Machines Owner Claimed Unclaimed Matched Preempting Drain
>
> Â X86_64/LINUXÂÂÂÂÂÂÂ 1ÂÂÂÂ 0ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0 0ÂÂÂÂÂ 0
>
> ÂÂÂÂÂÂÂÂ TotalÂÂÂÂÂÂÂ 1ÂÂÂÂ 0ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0 0ÂÂÂÂÂ 0
>
>
> Direct submission to condor from the CE host works, using the 
> following submit file:
>
> [sdalpra@ce02-htc htjobs]$ cat ce_testp308_gpu.sub
> universe = vanilla
>
> request_GPUs = 1
> requirements = (CUDACapability >= 1.2) && (CUDADeviceName =?= "Tesla 
> K40m") && $(requirements:True)
>
> executable = parrec_K40/parrec
> output = parrec.out
> error = parrec.err
> log = parrec.log
> arguments = "400 400 16 32 16"
>
> ShouldTransferFiles = YES
> WhenToTransferOutput = ON_EXIT
> transfer_input_files = parrec_K40/sinos_400.sdt, 
> parrec_K40/sinos_400.spr, parrec_K40/sinos.sct
> transfer_output_files = sinos_400.sdt, sinos_400.spr, sinos.sct
>
> queue
>
> ###########################
>
> Submission to the HTCondor-CE succeeds:
> [sdalpra@ui-htc htjobs]$ condor_submit -pool 
> ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool 
> ce_testp308_gpu.sub
> Submitting job(s).
> 1 job(s) submitted to cluster 2953.
>
> using this submit file:
>
> [sdalpra@ui-htc htjobs]$ cat ce_testp308_gpu.sub
> # Required for local HTCondor-CE submission
> universe = vanilla
> use_x509userproxy = true
> +Owner = undefined
>
> request_GPUs = 1
> requirements = (TARGET.CUDACapability >= 1.2) && 
> (TARGET.CUDADeviceName =?= "Tesla K40m")
> [.... the rest is the same...]
>
>
> Âhowever the requirements are overriden by the set_requirements entry 
> in the routing table:
>
> JOB_ROUTER_ENTRIES @=jre
> [
> ÂÂÂÂÂÂÂ name = "condor_pool_dteam";
> ÂÂÂÂÂÂÂ TargetUniverse = 5;
> ÂÂÂÂÂÂÂ Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
> ÂÂÂÂÂÂÂ set_requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys 
> == "LINUX");
> ÂÂÂÂÂÂÂ MaxJobs = 100;
> ÂÂÂÂÂÂÂ MaxIdleJobs = 100;
> ]
>
> By inspecting JOB_ROUTER_DEFAULTS it seems that the original 
> requirements are being overwritten anyway:
> [...] set_requirements = True [...]
>
>
> tracking a job submitted to the CE:
> [root@ce02-htc ~]# condor_ce_q -l 2917. -af RoutedToJobId requirements
> ClusterId = 2917
> ProcId = 0
> requirements = ((TARGET.CUDACapability >= 1.2) && 
> (TARGET.CUDADeviceName =?= "Tesla K40m")) && (TARGET.Arch == "X86_64") 
> && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && 
> (TARGET.Memory >= RequestMemory) && (TARGET.GPUs >= RequestGPUs) && 
> (TARGET.HasFileTransfer)
> RoutedToJobId = "2548.0"
>
>
> [root@ce02-htc ~]# condor_history -l 2548.0 -af RoutedFromJobId 
> requirements
> RoutedFromJobId = "2917.0"
> Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX")
>
>
> I made several attempts to have the requirements in the submit file 
> routed from the CE to condor, but have found no succesful way until now.
> Is it at all possible?
> Any inspiring example?
>
> Thank You
> Stefano
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/