[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support



Hi Kenyi,

Am 22.04.20 um 20:50 schrieb Gregory Thain:
> On 4/22/20 10:07 AM, Kenyi Hurtado Anampa wrote:
> 
>> Hello,
>>
>> We are submitting condor jobs that use singularity containers. The startds use the --nv feature, in order to bring GPU support inside the containers for Machine Learning applications:
>>
>> SINGULARITY_EXTRA_ARGUMENTS = --nv
>> SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage)
>> SINGULARITY_IMAGE_EXPR = TARGET.SingularityImage
>>
>> This works great, however, when we use condor_ssh_to_job, we lose the environment related to libcuda (what --nv does), see [1]. Could it be that condor does not use --nv when entering the container?
> 
> 
> Hi Kenyi:
> 
> 
> When condor_ssh_to_job lands on a singularity job, it ends up calling /usr/bin/nsenter to enter the container. This is because singularity provides no good way for a random process to enter another container using just the singularity tools. nsenter enters the mount namespace of the singularity container, which is what I thought that --nv setup.

the namespaces are not sufficient in this case, sadly. Here's the catch: You inherit the namespaces, but you lose the shell environment. 
I'm not sure what's the best way to work around this programmatically in HTCondor. One way might be to steal /proc/PID/environ from a process in the container and set that as environment when firing up nsenter. 
Greg, what do you think? 

Here is what we are doing at the moment with most recent HTCondor 8.8.8, which works around all known issues related to environment / missing PTY, and CUDA library setup
(this needs to be done *after* condor_ssh_to_job):

     # Work around missing PTY
     script /dev/null
     # Re-set home directory (of course, this needs to be adapted):
     export HOME=/jwd
     # Re-source /etc/profile:
     source /etc/profile
     # Fixup TERM
     export TERM=linux
     # And here's the magic trick for CUDA: 
     export LD_LIBRARY_PATH=/.singularity.d/libs/

The explanation for the last line is: Singularity binds the CUDA libraries (and also other stuff like AMD libraries if you use that) to
/.singularity.d/libs/ inside the container. Hence, you need to adjust the LD_LIBRARY_PATH to find that. That's injected into the environment
when singularity starts up processes inside, but nsenter has no way to extract that. 

Now I guess you are asking as a cluster adnim, and not as a user, right? 
As a cluster admin, what we did was to put all the "export"-fixup into /etc/profile.d/somefile. This means our users only need to do:
 script /dev/null
 source /etc/profile
and they are good to go. 

Cheers,
	Oliver


> 
> -greg
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature