[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_ssh_to_job inside containers with GPU support



Hello,

We are submitting condor jobs that use singularity containers. The startds use the --nv feature, in order to bring GPU support inside the containers for Machine Learning applications:

SINGULARITY_EXTRA_ARGUMENTS = --nv
SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage)
SINGULARITY_IMAGE_EXPR = TARGET.SingularityImage

This works great, however, when we use condor_ssh_to_job, we lose the environment related to libcuda (what --nv does), see [1]. Could it be that condor does not use --nv when entering the container?

Has anyone tried this?

[1]
[khurtado@camlnd ~]$ condor_ssh_to_job 60.0
Welcome to slot1_2@xxxxxxxxxxxx!
Your condor job is running with pid(s) 63160.
-sh: cannot set terminal process group (-1): Inappropriate ioctl for device
-sh: no job control in this shell
-sh: /root/.profile: Permission denied
-sh-4.2$ cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
-sh-4.2$ python
Python 2.7.5 (default, Aug Â7 2019, 00:51:29)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in <module>
  from tensorflow.python import pywrap_tensorflow Â# pylint: disable=unused-import
 File "/usr/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 49, in <module>
  from tensorflow.python import pywrap_tensorflow
 File "/usr/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
  raise ImportError(msg)
ImportError: Traceback (most recent call last):
 File "/usr/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
  from tensorflow.python.pywrap_tensorflow_internal import *
 File "/usr/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
  _pywrap_tensorflow_internal = swig_import_helper()
 File "/usr/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
  _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.

Â