[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Running multiple jobs on the same GPU



 

I understand that cuInit is not called by Condor. What I was trying to say is that I do not see the error if the same jobs are not run using Condor.

 

If I change my job to a script that prints out the CUDA_VISIBLE_DEVICES environment variable and then sleeps for over a minute, then all the jobs print â0â but still only one job runs at a time.

 

As to comment by Michael Pelletier, we are also able to keep our P100 ânice and toastyâ (loaded to 100%) by training about 15 machine learning jobs simultaneously on it, but it requires starting them manually, which is of course suboptimal.

 

However, I have just ran first tests of configuring Condor as described here:

 

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpusInSeriesSeven

 

and it seems to be working: I am able to run multiple jobs using the same GPU.

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Wednesday, April 25, 2018 4:43 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Running multiple jobs on the same GPU

 

condor never calls cuInit, so this message canât be coming from HTCondor.

 

âfailed call to cuInit: CUDA_ERROR_NO_DEVICEâ

 

If you change your job to a script that prints out the CUDA_VISIBLE_DEVICES environment variable and then sleeps for a while, do multiple jobs start? do they all print â0â ?

 

-tj

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Vaurynovich, Siarhei
Sent: Wednesday, April 25, 2018 3:17 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Running multiple jobs on the same GPU

 

John,

 

Thank you for your reply!

 

The solution does not seem to work, unfortunately. This is what I did:

 

  1. Added to my configuration file:

use feature : GPUs

GPU_DISCOVERY_EXTRA = -extra

MACHINE_RESOURCE_GPUS = CUDA0, CUDA0, CUDA0, CUDA0, CUDA0

CUDA_VISIBLE_DEVICES = 0

  1. âcondor_reconfigâ on the execute GPU node
  2. âservice condor restartâ on both master and execute nodes

 

In the submit file I tried to set

 

request_GPUs = 0.199 # only one GPU process starts

request_GPUs = 1  # only one GPU process starts

SlotID>=0 && SlotID<6 # all processes start, but only one gets GPU and the rest âfailed call to cuInit: CUDA_ERROR_NO_DEVICEâ

 

I am guessing that only one slot gets assigned a GPU since if I set a range of SlotIDs, which does not contain 0, then all jobs âfail to call cuInitâ.

 

If I run several of my jobs interactively, they all are able to use the GPU simultaneously, so it is an HTCondor issue.

 

I am using Condor 8.6.9-1

 

If you have any other ideas I could try or I did something wrong, please, do let me know.

 

Thank you,

Siarhei.

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Wednesday, April 25, 2018 10:20 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Running multiple jobs on the same GPU

 

I think if you double advertised cuda devices, you *would* get multiple jobs running on the same gpu.

 

If

 

MACHINE_RESOURCE_GPUS = CUDA0, CUDA0, CUDA1, CUDA1

 

Then the Startd could hand out resource CUDA0 twice.  and would set CUDA_VISIBLE_DEVICES = 0 both times because it sets that just by stripping off âCUDAâ and keeping the number.

 

If that is not working, then itâs a bug and we should fix it.

-tj

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Pelletier
Sent: Wednesday, April 25, 2018 9:03 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Running multiple jobs on the same GPU

 

From what I can tell, this isnât possible in a straightforward way.

 

With CPU cores, theyâre fungible, so if you want to assign half a core to a job you can just set the machineâs total CPU count to 2x what it actually is, and then have a job request one CPU, which means it will get half of one.

 

However, due to $CUDA_VISIBLE_DEVICES which is used to inform the job which GPU to use, the GPUs are not fungible, so if you double-advertised the GPUs you wouldnât get CUDA0, CUDA0, CUDA1, CUDA1, but 0,1,2,3 instead.

 

Perhaps you could do something with a user job wrapper script to remap the visible devices on machines with double-advertised GPUs? Transform CUDA1 to CUDA0, and CUDA0,CUDA1 to CUDA0, etc?

 

NVIDIAâs CUDA 9.1 package introduces a new service that partitions GPUs in the driver, so I think weâre starting to get to the point where weâll need to see GPUs as partitionable resources. Iâve been meaning to experiment with that feature to see how one would go about advertising it to the collector.

 

                -Michael Pelletier.

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Vaurynovich, Siarhei
Sent: Wednesday, April 25, 2018 9:49 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] [HTCondor-users] Running multiple jobs on the same GPU

 

 

Hello,

 

Could you please help me to figure out how to configure HTCondor to run multiple processes using the same GPU? Is it possible at all? Each process is rather light using <=20% of the GPU but there are many of them. I can certainly run more than one of them in parallel.

 

I restricted my processes to use only 1/3 of the GPU memory and provided in my submit file:

 

request_GPUs = 0.333

 

But HTCondor still only runs one GPU using process at the same time. Of course, I could restrict the slot numbers and not tell HTCondor that I will be using GPU, but I was wondering if there is a better solution.

 

Thank you for your help,

Siarhei.

 

 

............................................................................

Trading instructions sent electronically to Bernstein shall not be deemed
accepted until a representative of Bernstein acknowledges receipt
electronically or by telephone.  Comments in this e-mail transmission and
any attachments are part of a larger body of investment analysis. For our
research reports, which contain information that may be used to support
investment decisions, and disclosures see our website at
www.bernsteinresearch.com.

For further important information about AllianceBernstein please click here
http://www.abglobal.com/disclaimer/email/disclaimer.html