[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job access to own Job and Machine ClassAds?



Hi Steffen,

HTCondor sets the _CONDOR_JOB_AD and _CONDOR_MACHINE_AD environment variables 
that point to files containing the respective class Ad dumps.

Regarding the GPU related question:
The HTCondor jobs get environment variables set 
"CUDA_VISIBLE_DEVICES"&"GPU_DEVICE_ORDINAL"&"_CONDOR_AssignedGPUs" that 
contain a comma separated list of device identifiers. These are the GPUs that 
were assigned to the job.
At least with Nvidia GPUs, these are strings, not actual ordinals: e.g. 
"GPU-24a2cfec".
For some multi-GPU jobs and some frameworks that use CUDA under the hood, we 
observed that they weren't happy with the CUDA_VISIBLE_DEVICES being set to 
the string ids.. If we are reported that, we provide the uses with a script to 
translate the string env to an integral one. The script is attached. In case 
there are issues, we recommend running this script at the start of the job.. 
See the usage:

$ echo ${CUDA_VISIBLE_DEVICES}
GPU-9515f130,GPU-fc201e55,GPU-515d339b,GPU-77ac93e6
$ source translate_gpu_ids.sh  
now: CUDA_VISIBLE_DEVICES=0,1,2,3
$ echo ${CUDA_VISIBLE_DEVICES}
0,1,2,3

Yes, you can alter these environment variables in the job.. that can be 
abused.. but I'd really expect all users to not change that variable, since, 
when everybody would do that, nobody would be able to do anything useful...

Hope this helped!
- Joachim


Am Montag, 15. Mai 2023, 12:24:07 CEST schrieb Steffen Grunewald:
> Good morning/afternoon/...,
> 
> we're facing a problem with GPU-bound jobs, and while investigating the best
> approach to use a multi-GPU machine (I couldn't find an equivalent to CPU
> sharing - as that is done by the kernel), I was wondering
> 
> - Does a job running in its slot have a means to read its own Job ClassAd,
> and the Machine ClassAd of the slot it's running in?
> - If the answer is yes, how to do it without Python bindings?
> 
> (The background is: If the OSG gets access to some of our GPUs, how do we
> and how do the users make sure there are no collisions? If there's already
> a canonical way to assign and use GPUs known to, and used by, everyone -
> I'd like to join in... If there isn't, how to set up a standard?)
> 
> Thanks,
>  Steffen

Attachment: translate_gpu_ids.sh
Description: application/shellscript