[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Removing / disabling GPUs without stopping jobs



Hi Joan,

As I recall the list of available GPU cards in the partitionable slots come from the DetectedGPUs attribute as generated by the condor_gpu_discovery command, which then turns into the AssignedGPUs attribute for the slot, and I think that if you manipulate that list you can control which GPUs are available to be attached to jobs, and the order.

What I'd suggest is using a wrapper around the condor_gpu_discovery command to filter the information based on your detection of the unusable GPU cards. However I think with that approach you might wind up having to reconfig every time a state change was identified, since the detection tool isn't run periodically by default, or run the discovery as a periodic startd cron job.

Hopefully one of the GPU experts at CHTC will chime in.

Michael V Pelletier
Principal Engineer

Raytheon Technologies
Information Technology
Digital Transormation & Innovation
 


-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Joan Josep Piles-Contreras
Sent: Thursday, May 14, 2020 5:41 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] [HTCondor-users] Removing / disabling GPUs without stopping jobs

Hi,

In a partitionable slot, is there any way to disable a given GPU without restarting the startd, or in any case to prevent it from being assigned to new jobs? In short, what `OFFLINE_MACHINE_RESOURCE_<name>` already does, just without restarting the startd (i.e. without stopping already running jobs).

We have some nodes with multiple GPUs, and from time to time one of the GPUs crashes and ends up in some unstable state. The node then becomes a "black hole" because it keeps accepting jobs that then just crash.

We can already detect it, but fixing it usually requires a reboot (trying to reset the card doesn't always do the trick, we already tried).

What we ideally would want is to prevent the non-working card from being assigned to new jobs until we can find the right spot to reboot the node, because some jobs need a long time to run, and we want to make it as little intrusive as possible.

With static slots it would be easy to just set START to false in the slot, but would it be possible to do something equivalent with dynamic slots?

I fear the system will be very confused if we make the START expression conditional on the assigned GPU...

Best,

Joan

--
Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/