[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [External] Additional GPU statistics



Hello,

What you’d want to do is set up a startd cron job. The ClassAd output from this is pulled into the Machine ClassAd and this becomes queriable by condor_status. 

I do something similar with a job that calls ipmitool to check the power and cooling status of the machine and set a PowerOrCoolingFault Boolean attribute, allowing it to reject jobs if a PSU or fan fault is flagged.

You can set the interval for startd cron jobs in the configuration. Bear in mind that the collector is only updated periodically so a higher frequency doesn’t gain you anything. I think it’s possible to push updates immediately from startd cron, but you’d want to keep an eye on the collector load in that case if you have a lot of machines. 

-Michael Pelletier. 

Get Outlook for iOS

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Benedikt Riedel <briedel@xxxxxxxxxxxxxxxx>
Sent: Wednesday, March 20, 2024 5:08:58 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] [HTCondor-users] Additional GPU statistics
 
Hi,

Is there a way to get additional GPU statistics like the power draw through condor? Is there a way to increase the query rate for GPU statistics from HTCondor?

Thanks,

Benedikt

--
Benedikt Riedel
Global Computing Coordinator IceCube Neutrino Observatory
Technical Coordinator IceCube Neutrino Observatory
Computing Manager Wisconsin IceCube Particle Astrophysics Center
University of Wisconsin-Madison