Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support

Date: Thu, 23 Apr 2020 12:23:23 +0200
From: Thomas Hartmann <thomas.hartmann@xxxxxxx>
Subject: Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support

Hi Oliver,

/proc/PID/environ might work - however it just contains the initalenvironment of the process as far as I know. Might be a problem, if theprocess updates its environment later on and it diverges from asideloaded condor_ssh_to_job


Cheers,
  Thomas

On 22/04/2020 21.06, Oliver Freyermuth wrote:

Hi Kenyi,

Am 22.04.20 um 20:50 schrieb Gregory Thain:

On 4/22/20 10:07 AM, Kenyi Hurtado Anampa wrote:

Hello,

We are submitting condor jobs that use singularity containers. The startds use the --nv feature, in order to bring GPU support inside the containers for Machine Learning applications:

SINGULARITY_EXTRA_ARGUMENTS = --nv
SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage)
SINGULARITY_IMAGE_EXPR = TARGET.SingularityImage

This works great, however, when we use condor_ssh_to_job, we lose the environment related to libcuda (what --nv does), see [1]. Could it be that condor does not use --nv when entering the container?



Hi Kenyi:


When condor_ssh_to_job lands on a singularity job, it ends up calling /usr/bin/nsenter to enter the container.Â This is because singularity provides no good way for a random process to enter another container using just the singularity tools.Â nsenter enters the mount namespace of the singularity container, which is what I thought that --nv setup.

the namespaces are not sufficient in this case, sadly. Here's the catch: You inherit the namespaces, but you lose the shell environment.
I'm not sure what's the best way to work around this programmatically in HTCondor. One way might be to steal /proc/PID/environ from a process in the container and set that as environment when firing up nsenter.
Greg, what do you think?

Here is what we are doing at the moment with most recent HTCondor 8.8.8, which works around all known issues related to environment / missing PTY, and CUDA library setup
(this needs to be done *after* condor_ssh_to_job):

# Work around missing PTY
script /dev/null
# Re-set home directory (of course, this needs to be adapted):
export HOME=/jwd
# Re-source /etc/profile:
source /etc/profile
# Fixup TERM
export TERM=linux
# And here's the magic trick for CUDA:
export LD_LIBRARY_PATH=/.singularity.d/libs/

The explanation for the last line is: Singularity binds the CUDA libraries (and also other stuff like AMD libraries if you use that) to
/.singularity.d/libs/ inside the container. Hence, you need to adjust the LD_LIBRARY_PATH to find that. That's injected into the environment
when singularity starts up processes inside, but nsenter has no way to extract that.

Now I guess you are asking as a cluster adnim, and not as a user, right?
As a cluster admin, what we did was to put all the "export"-fixup into /etc/profile.d/somefile. This means our users only need to do:
script /dev/null
source /etc/profile
and they are good to go.

Cheers,
Oliver


-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support
  - From: o . freyermuth

References:
- [HTCondor-users] condor_ssh_to_job inside containers with GPU support
  - From: Kenyi Hurtado Anampa
- Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support
  - From: Gregory Thain
- Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support
  - From: Oliver Freyermuth

Prev by Date: Re: [HTCondor-users] Jobs scheduling with flock_to
Next by Date: Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support
Previous by thread: Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support
Next by thread: Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support