[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Translating GPU device assignments?



GPU_DEVICE_ORDINAL is the equivalent of CUDA_VISIBLE_DEVICES for OpenCL, It would be incorrect for us to renumber it.

it sounds like you are saying that the job shouldn't look at CUDA_VISIBLE_DEVICES at all, it should just look at the number of GPUs it has been assigned
and then start from 0. 

-tj


-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Pelletier
Sent: Thursday, July 6, 2017 10:04 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Translating GPU device assignments?

A little bit of follow-up as I worked on this over the long weekend.

[Michael Pelletier] 
So it turns out that the CUDA_VISIBLE_DEVICES=2,3 environment variable prompts the CUDA library to renumber the GPU ordinals for those devices to 0,1.

Thus in order to get the correct ordinals, you can't just use CUDA_VISIBLE_DEVICES or GPU_DEVICE_ORDINAL.

So it seems that the GPU_DEVICE_ORDINAL variable is being set incorrectly - when used in combination with CUDA_VISIBLE_DEVICES, it should be set to 0 through however many GPUs are requested.

I've worked around via:


GPU_ORDINAL = $CHOICE(REQGPU_INT, "error", "0", "0,1", "0,1,2", \
    "0,1,2,3", "0,1,2,3,4", "0,1,2,3,4,5", "0,1,2,3,4,5,6", \
    "0,1,2,3,4,5,6,7", "0,1,2,3,4,5,6,7,8", "too_many_gpus_requested")

And as I mentioned before, it'd be great to have this as a job attribute as well.

	-Michael Pelletier.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/