[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] requiring gpu through a HTCondor-CE



Hello,

I have an exec node equipped with two GPUs:
[root@hpc-200-06-07 ~]# /usr/libexec/condor/condor_gpu_discovery -properties
DetectedGPUs="CUDA0, CUDA1"
CUDACapability=3.5
CUDADeviceName="Tesla K40m"
CUDADriverVersion=10.0
CUDAECCEnabled=true
CUDAGlobalMemoryMb=11441
CUDA0DevicePciBusId="0000:***"
CUDA0DeviceUuid="0caa****"
CUDA1DevicePciBusId="0000:***"
CUDA1DeviceUuid="158****"

The host can be identified through requirements:

[root@ce02-htc ~]# condor_status -constraint '((CUDACapability >= 1.2) && (CUDADeviceName =?= "Tesla K40m")) && (Arch == "X86_64") && (OpSys == "LINUX")' NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ OpSysÂÂÂÂÂ ArchÂÂ State Activity LoadAv MemÂÂÂÂ ActvtyTime

slot1@wn-01-02-03**** LINUXÂÂÂÂÂ X86_64 Unclaimed IdleÂÂÂÂÂ 0.000 128737Â 3+01:23:46

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Machines Owner Claimed Unclaimed Matched Preempting Drain

 X86_64/LINUX 1 0 0 1 0 0 0

ÂÂÂÂÂÂÂÂ TotalÂÂÂÂÂÂÂ 1ÂÂÂÂ 0ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0 0ÂÂÂÂÂ 0


Direct submission to condor from the CE host works, using the following submit file:

[sdalpra@ce02-htc htjobs]$ cat ce_testp308_gpu.sub
universe = vanilla

request_GPUs = 1
requirements = (CUDACapability >= 1.2) && (CUDADeviceName =?= "Tesla K40m") && $(requirements:True)

executable = parrec_K40/parrec
output = parrec.out
error = parrec.err
log = parrec.log
arguments = "400 400 16 32 16"

ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
transfer_input_files = parrec_K40/sinos_400.sdt, parrec_K40/sinos_400.spr, parrec_K40/sinos.sct
transfer_output_files = sinos_400.sdt, sinos_400.spr, sinos.sct

queue

###########################

Submission to the HTCondor-CE succeeds:
[sdalpra@ui-htc htjobs]$ condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool ce_testp308_gpu.sub
Submitting job(s).
1 job(s) submitted to cluster 2953.

using this submit file:

[sdalpra@ui-htc htjobs]$ cat ce_testp308_gpu.sub
# Required for local HTCondor-CE submission
universe = vanilla
use_x509userproxy = true
+Owner = undefined

request_GPUs = 1
requirements = (TARGET.CUDACapability >= 1.2) && (TARGET.CUDADeviceName =?= "Tesla K40m")
[.... the rest is the same...]


Âhowever the requirements are overriden by the set_requirements entry in the routing table:

JOB_ROUTER_ENTRIES @=jre
[
ÂÂÂÂÂÂÂ name = "condor_pool_dteam";
ÂÂÂÂÂÂÂ TargetUniverse = 5;
ÂÂÂÂÂÂÂ Requirements = (regexp("dteam", TARGET.x509UserProxyVoName));
ÂÂÂÂÂÂÂ set_requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX");
ÂÂÂÂÂÂÂ MaxJobs = 100;
ÂÂÂÂÂÂÂ MaxIdleJobs = 100;
]

By inspecting JOB_ROUTER_DEFAULTS it seems that the original requirements are being overwritten anyway:
[...] set_requirements = True [...]


tracking a job submitted to the CE:
[root@ce02-htc ~]# condor_ce_q -l 2917. -af RoutedToJobId requirements
ClusterId = 2917
ProcId = 0
requirements = ((TARGET.CUDACapability >= 1.2) && (TARGET.CUDADeviceName =?= "Tesla K40m")) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.GPUs >= RequestGPUs) && (TARGET.HasFileTransfer)
RoutedToJobId = "2548.0"


[root@ce02-htc ~]# condor_history -l 2548.0 -af RoutedFromJobId requirements
RoutedFromJobId = "2917.0"
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX")


I made several attempts to have the requirements in the submit file routed from the CE to condor, but have found no succesful way until now.
Is it at all possible?
Any inspiring example?

Thank You
Stefano