[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support



Hi Oliver,

/proc/PID/environ might work - however it just contains the inital environment of the process as far as I know. Might be a problem, if the process updates its environment later on and it diverges from a sideloaded condor_ssh_to_job

Cheers,
  Thomas

On 22/04/2020 21.06, Oliver Freyermuth wrote:
Hi Kenyi,

Am 22.04.20 um 20:50 schrieb Gregory Thain:
On 4/22/20 10:07 AM, Kenyi Hurtado Anampa wrote:

Hello,

We are submitting condor jobs that use singularity containers. The startds use the --nv feature, in order to bring GPU support inside the containers for Machine Learning applications:

SINGULARITY_EXTRA_ARGUMENTS = --nv
SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage)
SINGULARITY_IMAGE_EXPR = TARGET.SingularityImage

This works great, however, when we use condor_ssh_to_job, we lose the environment related to libcuda (what --nv does), see [1]. Could it be that condor does not use --nv when entering the container?


Hi Kenyi:


When condor_ssh_to_job lands on a singularity job, it ends up calling /usr/bin/nsenter to enter the container. This is because singularity provides no good way for a random process to enter another container using just the singularity tools. nsenter enters the mount namespace of the singularity container, which is what I thought that --nv setup.

the namespaces are not sufficient in this case, sadly. Here's the catch: You inherit the namespaces, but you lose the shell environment.
I'm not sure what's the best way to work around this programmatically in HTCondor. One way might be to steal /proc/PID/environ from a process in the container and set that as environment when firing up nsenter.
Greg, what do you think?

Here is what we are doing at the moment with most recent HTCondor 8.8.8, which works around all known issues related to environment / missing PTY, and CUDA library setup
(this needs to be done *after* condor_ssh_to_job):

      # Work around missing PTY
      script /dev/null
      # Re-set home directory (of course, this needs to be adapted):
      export HOME=/jwd
      # Re-source /etc/profile:
      source /etc/profile
      # Fixup TERM
      export TERM=linux
      # And here's the magic trick for CUDA:
      export LD_LIBRARY_PATH=/.singularity.d/libs/

The explanation for the last line is: Singularity binds the CUDA libraries (and also other stuff like AMD libraries if you use that) to
/.singularity.d/libs/ inside the container. Hence, you need to adjust the LD_LIBRARY_PATH to find that. That's injected into the environment
when singularity starts up processes inside, but nsenter has no way to extract that.

Now I guess you are asking as a cluster adnim, and not as a user, right?
As a cluster admin, what we did was to put all the "export"-fixup into /etc/profile.d/somefile. This means our users only need to do:
  script /dev/null
  source /etc/profile
and they are good to go.

Cheers,
	Oliver



-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature