[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Translating GPU device assignments?



> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> Of Greg Thain
> Sent: Monday, July 03, 2017 2:10 PM
> 
> On 07/03/2017 12:32 PM, Michael Pelletier wrote:
> > +Arguments = "-c '/usr/bin/env FLAGS_gpu=$(DOLLAR)GPU_DEVICE_ORDINAL
> caffe train ...etc...'"
> 
> 
> Very nice!  Technically, isn't the /usr/bin/env redundant, as /bin/sh
> itself can set the environment on the command line?  i.e.
> 
> 
> +Arguments = "-c 'FLAGS_gpu=$(DOLLAR)GPU_DEVICE_ORDINAL caffe train
> ...etc...'"
[Michael Pelletier] 

Quite correct - and the +Arguments notation is unnecessary as well - the arguments translate as needed with the submit description quote expansion as well.

And I managed to figure out another problem - if you run caffe with --gpu=1 and your CUDA_VISIBLE_DEVICES is set to 1 as well, then it thinks the --gpu argument is invalid - it will only recognize device 0:


[pelletm@hostname dir_32240]$ unset CUDA_VISIBLE_DEVICES
[pelletm@hostname dir_32240]$ ./caffe device_query --gpu=0
I0704 16:57:17.567883 32733 caffe.cpp:112] Querying GPUs 0
I0704 16:57:19.188906 32733 common.cpp:168] Device id:                     0
I0704 16:57:19.188983 32733 common.cpp:169] Major revision number:         3
I0704 16:57:19.188995 32733 common.cpp:170] Minor revision number:         0
I0704 16:57:19.189083 32733 common.cpp:171] Name:                          GRID K1
...
[pelletm@hostname dir_32240]$ ./caffe device_query --gpu=1
I0704 16:57:20.726950 32735 caffe.cpp:112] Querying GPUs 1
I0704 16:57:22.323016 32735 common.cpp:168] Device id:                     1
I0704 16:57:22.323065 32735 common.cpp:169] Major revision number:         3
I0704 16:57:22.323077 32735 common.cpp:170] Minor revision number:         0
I0704 16:57:22.323086 32735 common.cpp:171] Name:                          GRID K1

However:

[pelletm@hostname dir_32240]$ export CUDA_VISIBLE_DEVICES=1
[pelletm@hostname dir_32240]$ ./caffe device_query --gpu=1
I0704 16:58:17.985580 32759 caffe.cpp:112] Querying GPUs 1
F0704 16:58:19.722486 32759 common.cpp:148] Check failed: error == cudaSuccess (10 vs. 0)  invalid device ordinal
...
[pelletm@eand-dplrn2 dir_32240]$ ./caffe device_query --gpu=0
I0704 16:59:17.303726 32783 caffe.cpp:112] Querying GPUs 0
I0704 16:59:19.081667 32783 common.cpp:168] Device id:                     0
I0704 16:59:19.081717 32783 common.cpp:169] Major revision number:         3
I0704 16:59:19.081727 32783 common.cpp:170] Minor revision number:         0
I0704 16:59:19.081733 32783 common.cpp:171] Name:                          GRID K1

It throws a core dump too, so I think I'll go ahead and open an issue for it at Caffe.
And the workaround is to include an "unset CUDA_VISIBLE_DEVICES" in the sh -c commandline.

	-Michael Pelletier.