[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_ssh_to_job broken with 8.8 on CentOS 7



On Wed, 2019-02-27 at 09:57:12 +0100, Oliver Freyermuth wrote:
> Good morning,
> 
> Am 27.02.19 um 09:43 schrieb Steffen Grunewald:
> >> - The argument "-a" to nsenter not being present on CentOS 7
> > 
> > also not in Debian Stretch (and now that I check it with Jessie, 
> > it hasn't been there too - why dod nobody notice?)
> 
> It's present in Ubuntu 18.04 LTS, maybe all container users (apart from us) have that on their servers,
> or have not upgraded yet. 

I've still got to find a man page that lists -a as a valid option;
https://www.systutorials.com/docs/linux/man/1-nsenter/ doesn't have it.

The problem was introduced with 8.8.0, as I could find by comparing
src/condor_starter.V6.1/os_proc.cpp of 8.6 and 8.8 versions.
Since some of the biggest Condor users refuse to run x.y.0 this wasn't
discover earlier...

> > I'm lacking a test case at the moment, but I'm fearing the worst.
> > Still nobody has attempted to run containers, it seems (or they failed
> > and failed to report it?)
> 
> I think Greg is correct and "-U" only fails with setuid root Singularity
> (the code "only" affects Singularity users in any case). Probably "-a" would do the correct thing,
> since Singularity with setuid root does not create a new user namespace so there's no need to attach. 

According to both the man page referred to above, and the Debian one,
replacing "-a" with "-m -u -i -n -p -U" should do the trick (this
leaves to be discussed whether all of those are needed - but if "-a"
is supposed to work in the given context, then its "full expansion"
should also do).

Here's the diff:

condor-8.8.1# diff -u src/condor_starter.V6.1/os_proc.cpp{.ORIG,}
--- src/condor_starter.V6.1/os_proc.cpp.ORIG    2019-02-19 05:08:49.000000000 +0100
+++ src/condor_starter.V6.1/os_proc.cpp 2019-02-27 10:09:43.513715435 +0100
@@ -1106,7 +1106,13 @@
        }
        ArgList args;
        args.AppendArg("/usr/bin/nsenter");
-       args.AppendArg("-a"); // all namespaces
+       #args.AppendArg("-a"); // all namespaces
+       args.AppendArg("-m");
+       args.AppendArg("-u");
+       args.AppendArg("-i");
+       args.AppendArg("-n");
+       args.AppendArg("-p");
+       args.AppendArg("-U");
        args.AppendArg("-t"); // target pid
        char buf[32];
        sprintf(buf,"%d", pid);

Greg, did I overlook something?

I'll make this a Debian patch, and rebuild, if there's no veto...

> I'll try to find out the best working combination in the next days. Potentially, disabling the automatic killing of the "sleep" job
> before nsenter attaches, wrappering nsenter correctly for CentOS 7 and setting some environment variables to have a well-defined PATH and working
> bash initialization when attaching could work around all discovered issues. 

Since adding the reaper seems to be the second change made to the same source file,
perhaps we can learn about the rationale behind that?

> At least, now our users are out of the game and there's less stress from people flooding me with mails since their interactive jobs don't start,
> and we have a good way to test out such things before rolling them out to the full cluster ;-). 

Thanks (to you and them) for being the Guinea pigs...

- S


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~