[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPUs_MONITOR resource usage



Hey Michael,

I plan on updating these servers next week. It looks like I can pull 3.96 from the graphics-driver ppa so I will update them when I can coordinate with the users and I'll see if that resolves the issue, and report back.

Can you, or someone else on the list, clarify what the the following line (which I have commented out) does?
GPU_DISCOVERY_EXTRA = -extraÂ

Per the wiki:
Advertise additional attributes of the GPUs by also setting

I believe I can see all the resources available on the gpus without that setting which I'm assuming carry forward to the ClassAds.

Thanks,

Sander



On Fri, Jan 11, 2019 at 12:55 PM Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx> wrote:
Alexander,

That "lsmod" output indicates that you are using the NVIDIA drivers, so that's another good data point.

Your driver version is 390.87, so of course the first thing the NVIDIA tech is going to ask is whether anyone has tried the 396.44 release. Are you in a position where you can upgrade the driver on this machine? If not, I may be able to finagle a testbed here - it appears that one of my machines has 396.37, though I'm not quite ready to dive in to an 8.6->8.8 upgrade on a Friday afternoon.


Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Alexander Antoniades
Sent: Friday, January 11, 2019 12:31 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] GPUs_MONITOR resource usage

I was wondering if I was supposed to reply to that. :)

On the machine in question I believe we are running the nvidia drivers. I see the x-org nouveau drivers installed, but I'm not sure if that's an issue or not.

The used by numbers on the nvidia module are a little eye-opening, but they are high on other nodes which don't have the condor gpu monitor running (although condor is running if that's an issue).

I can provide any other other output you want, but here's what I'm looking at:

root@gpu2:~# lsmod | grep nv
nvidia_uvm      757760 4
nvidia_drm      Â40960 0
nvidia_modeset   Â1114112 1 nvidia_drm
nvidia       14364672 632 nvidia_modeset,nvidia_uvm
drm_kms_helper    172032 2 ast,nvidia_drm
drm         Â401408 5 ast,ttm,nvidia_drm,drm_kms_helper
ipmi_msghandler    53248 4 nvidia,ipmi_ssif,ipmi_devintf,ipmi_si



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/