[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job access to own Job and Machine ClassAds?



On Mon, 2023-05-15 at 13:01:25 +0200, Joachim Meyer wrote:
> Hi Steffen,
> 
> HTCondor sets the _CONDOR_JOB_AD and _CONDOR_MACHINE_AD environment variables 
> that point to files containing the respective class Ad dumps.

Thanks, I must have missed those before, got them now.

> Regarding the GPU related question:
> The HTCondor jobs get environment variables set 
> "CUDA_VISIBLE_DEVICES"&"GPU_DEVICE_ORDINAL"&"_CONDOR_AssignedGPUs" that 
> contain a comma separated list of device identifiers. These are the GPUs that 
> were assigned to the job.

I didn't get CUDA_VISIBLE_DEVICES, probably because I had set
 ENVIRONMENT_FOR_AssignedGPUs    = VISIBLE_GPUS=/^/gpuid:/
in the config (as recommended in the docs) - so I got VISIBLE_GPUS instead,
with the proper GPU IDs.

When I use
 ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES
instead, only the **digits** from the ID are returned, and in case of a non-GPU
job, $CUDA_VISIBLE_DEVICES is set to 10000. In no case I get a proper match
from the translation script you suggested.

> At least with Nvidia GPUs, these are strings, not actual ordinals: e.g. 
> "GPU-24a2cfec".

That's what I get with VISIBLE_GPUS, not with CUDA_VISIBLE_DEVICES (the ID would
look like 242 then). Whatever mangles the ID strings...

> For some multi-GPU jobs and some frameworks that use CUDA under the hood, we 
> observed that they weren't happy with the CUDA_VISIBLE_DEVICES being set to 
> the string ids.. If we are reported that, we provide the uses with a script to 
> translate the string env to an integral one. The script is attached. In case 
> there are issues, we recommend running this script at the start of the job.. 
> See the usage:
> 
> $ echo ${CUDA_VISIBLE_DEVICES}
> GPU-9515f130,GPU-fc201e55,GPU-515d339b,GPU-77ac93e6
> $ source translate_gpu_ids.sh  
> now: CUDA_VISIBLE_DEVICES=0,1,2,3
> $ echo ${CUDA_VISIBLE_DEVICES}
> 0,1,2,3

Using your script and injecting _CONDOR_AssignedGPUs instead, I indeed get 
some useful CUDA_VISIBLE_DEVICES (1, 2, ... as GPU 0 is already in use) for
GPU jobs; it has been suggested to set it to -1 for the non-GPU jobs.
(Could ENVIRONMENT_VALUE_FOR_UnAssignedGPUs be used for this?)

> Yes, you can alter these environment variables in the job.. that can be 
> abused.. but I'd really expect all users to not change that variable, since, 
> when everybody would do that, nobody would be able to do anything useful...

At the moment it seems everyone is (ab)using the wrong GPUs because HTCondor
requires the user to do the right things (use a translator like yours) and
confuses me with mangles GPU IDs. I'll continue my search for a consistent
recipe that takes the burden of finding the right GPU off the (anonymous)
user of our (randomly assigned) resources... without the need of a user-
provided translator.


Thanks so far,
 S


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~