[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Pilot jobs for GPU sharing - Python bindings? Etc?



As some of you may recall, I've been mulling options for getting a single GPU to accept multiple jobs without a static split of the GPU via startd lies, because we have a range of job sizes, some of which actually need the whole GPU but others which only need a fraction of it.

Indeed, one of the jobs could potentially run 10-12 instances on each GPU, slashing our capital outlay and vastly increasing the users' productivity.

So the use of a "pilot job" occurred to me - the job would launch into a single-GPU claim with enough memory and CPU to support the 10-12 instances, and then pull its collection of 10-12 jobs and thus keep the GPU nice and toasty. Now, certainly the pilot job would be able to pull job information from a CSV file or some little cobbled-together gizmo, but what I was wondering is whether it would be feasible to have it pull work from the schedd queue, or create its own dummy startd and machine ad to which the jobs could match.

Anything in the Python bindings that might be able to doll up this approach and integrate more tightly with the queue? Is anyone doing the same sort of thing somewhere?

Thanks for any suggestions?

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company