[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Pilot jobs for GPU sharing - Python bindings? Etc?



Hi Michael,

coming from the LHC/grid world my first idea would be some kind of
replicate of the CMS approach. Their pilot jobs start each an own
startd/Condor node, i.e., riding piggyback on top of grid batch systems,
 and feeding their global Condor with the actual payloads.
Probably some simple Singularity container as pilot with the basic
Condor things and which connects just to a small sub-GPU-cluster and
feed that? (But that might be terrible overkill just to saturate some
GPUs...)

Cheers,
  Thomas

On 2018-10-31 02:46, Michael Pelletier wrote:
> As some of you may recall, I've been mulling options for getting a single GPU to accept multiple jobs without a static split of the GPU via startd lies, because we have a range of job sizes, some of which actually need the whole GPU but others which only need a fraction of it.
> 
> Indeed, one of the jobs could potentially run 10-12 instances on each GPU, slashing our capital outlay and vastly increasing the users' productivity.
> 
> So the use of a "pilot job" occurred to me - the job would launch into a single-GPU claim with enough memory and CPU to support the 10-12 instances, and then pull its collection of 10-12 jobs and thus keep the GPU nice and toasty. Now, certainly the pilot job would be able to pull job information from a CSV file or some little cobbled-together gizmo, but what I was wondering is whether it would be feasible to have it pull work from the schedd queue, or create its own dummy startd and machine ad to which the jobs could match.
> 
> Anything in the Python bindings that might be able to doll up this approach and integrate more tightly with the queue? Is anyone doing the same sort of thing somewhere?
> 
> Thanks for any suggestions?
> 
> Michael V. Pelletier
> Information Technology
> Digital Transformation & Innovation
> Integrated Defense Systems
> Raytheon Company
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature