[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_ssh_to_job inside containers with GPU support



Hi Thomas,

Am 23.04.20 um 12:23 schrieb Thomas Hartmann:
> Hi Oliver,
> 
> /proc/PID/environ might work - however it just contains the inital environment of the process as far as I know. Might be a problem, if the process updates its environment later on and it diverges from a sideloaded condor_ssh_to_job

indeed, thanks for pointing that out! 
I guess that using the "environ" of the first child of the process spawned by the starter should work fine in this usecase, since that process is essentially the
the job executable (which is a shell executing a "sleep" loop for the interactive job), so this would be the "inside the container"-environment. 

So an idea would be to adapt the HTCondor code firing up nsenter to steal the "environ" from that process and use it here:
 https://github.com/htcondor/htcondor/blob/422cf6cb36de091c7624d3b69a0834124efaaba6/src/condor_starter.V6.1/os_proc.cpp#L1203
instead of the starter environment itself. 
The nice thing about this approach is that it would work with any container runtime out there (not just Singularity). 

@Greg, what do you think?

Cheers,
	Oliver

> 
> Cheers,
> Â Thomas
> 
> On 22/04/2020 21.06, Oliver Freyermuth wrote:
>> Hi Kenyi,
>>
>> Am 22.04.20 um 20:50 schrieb Gregory Thain:
>>> On 4/22/20 10:07 AM, Kenyi Hurtado Anampa wrote:
>>>
>>>> Hello,
>>>>
>>>> We are submitting condor jobs that use singularity containers. The startds use the --nv feature, in order to bring GPU support inside the containers for Machine Learning applications:
>>>>
>>>> SINGULARITY_EXTRA_ARGUMENTS = --nv
>>>> SINGULARITY_JOB = !isUndefined(TARGET.SingularityImage)
>>>> SINGULARITY_IMAGE_EXPR = TARGET.SingularityImage
>>>>
>>>> This works great, however, when we use condor_ssh_to_job, we lose the environment related to libcuda (what --nv does), see [1]. Could it be that condor does not use --nv when entering the container?
>>>
>>>
>>> Hi Kenyi:
>>>
>>>
>>> When condor_ssh_to_job lands on a singularity job, it ends up calling /usr/bin/nsenter to enter the container. This is because singularity provides no good way for a random process to enter another container using just the singularity tools. nsenter enters the mount namespace of the singularity container, which is what I thought that --nv setup.
>>
>> the namespaces are not sufficient in this case, sadly. Here's the catch: You inherit the namespaces, but you lose the shell environment.
>> I'm not sure what's the best way to work around this programmatically in HTCondor. One way might be to steal /proc/PID/environ from a process in the container and set that as environment when firing up nsenter.
>> Greg, what do you think?
>>
>> Here is what we are doing at the moment with most recent HTCondor 8.8.8, which works around all known issues related to environment / missing PTY, and CUDA library setup
>> (this needs to be done *after* condor_ssh_to_job):
>>
>> ÂÂÂÂÂ # Work around missing PTY
>> ÂÂÂÂÂ script /dev/null
>> ÂÂÂÂÂ # Re-set home directory (of course, this needs to be adapted):
>> ÂÂÂÂÂ export HOME=/jwd
>> ÂÂÂÂÂ # Re-source /etc/profile:
>> ÂÂÂÂÂ source /etc/profile
>> ÂÂÂÂÂ # Fixup TERM
>> ÂÂÂÂÂ export TERM=linux
>> ÂÂÂÂÂ # And here's the magic trick for CUDA:
>> ÂÂÂÂÂ export LD_LIBRARY_PATH=/.singularity.d/libs/
>>
>> The explanation for the last line is: Singularity binds the CUDA libraries (and also other stuff like AMD libraries if you use that) to
>> /.singularity.d/libs/ inside the container. Hence, you need to adjust the LD_LIBRARY_PATH to find that. That's injected into the environment
>> when singularity starts up processes inside, but nsenter has no way to extract that.
>>
>> Now I guess you are asking as a cluster adnim, and not as a user, right?
>> As a cluster admin, what we did was to put all the "export"-fixup into /etc/profile.d/somefile. This means our users only need to do:
>> Â script /dev/null
>> Â source /etc/profile
>> and they are good to go.
>>
>> Cheers,
>> ÂÂÂÂOliver
>>
>>
>>>
>>> -greg
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 


-- 
Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--