[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_ssh_to_job broken with 8.8 on CentOS 7



Dear Greg,

Am 26.02.19 um 18:18 schrieb Greg Thain:
On 2/26/19 11:09 AM, Oliver Freyermuth wrote:
Dear HTCondor experts, dear Greg,

trying a dirty hack to replace "-a" with "-m -u -i -n -p -U" still makes things fail miserably,
since Singularity has somehow already exited when nsenter is called:


How has Singularity exited? It should still be running the job at that time?

I'm also rather stupefied by this.
Here's what I see with 10 millisecond process tree snapshots.

First, all is well:

freyermu 18402  2.0  0.0  20000   832 ?        SNs  18:22   0:00          \_ /usr/libexec/singularity/bin/action-suid /bin/sleep 180
freyermu 18414  0.0  0.0  27288   856 ?        SN   18:22   0:00              \_ shim-init                                /bin/sleep 180
freyermu 18415  0.0  0.0   4116   312 ?        SN   18:22   0:00                  \_ /bin/sleep 180

Then, condor_ssh_to_job is prepared:
freyermu 18402  2.0  0.0  20000   832 ?        SNs  18:22   0:00          \_ /usr/libexec/singularity/bin/action-suid /bin/sleep 180
freyermu 18414  0.0  0.0  27288   856 ?        SN   18:22   0:00          |   \_ shim-init                                /bin/sleep 180
freyermu 18415  0.0  0.0   4116   312 ?        SN   18:22   0:00          |       \_ /bin/sleep 180
freyermu 18503  0.0  0.0  21980  1536 ?        S    18:23   0:00          \_ /bin/sh /usr/libexec/condor/condor_ssh_to_job_sshd_setup /pool/condor/dir_18316 /usr/libexec/condor/condor_ssh_to_job_shell_setup /etc/condor/condor_ssh_to_job_sshd_config_template "/usr/bin/ssh-keygen" "-N" "" "-C" "" "-q" "-f" "%f" "-t" "rsa"

Finally, SSH is started outside of the container:
freyermu 18402  1.0  0.0  20000   832 ?        SNs  18:22   0:00          \_ /usr/libexec/singularity/bin/action-suid /bin/sleep 180
freyermu 18414  0.0  0.0  27288   856 ?        SN   18:22   0:00          |   \_ shim-init                                /bin/sleep 180
freyermu 18415  0.0  0.0   4116   312 ?        SN   18:22   0:00          |       \_ /bin/sleep 180
freyermu 18518 22.0  0.0 125228  4616 ?        SNs  18:23   0:00          \_ sshd: freyermu [priv]

And then, I see this:
root     18544  0.0  0.0 112728   976 pts/0    S+   18:23   0:00          \_ grep --color=auto freyermu
freyermu 18402  1.0  0.0      0     0 ?        ZNs  18:22   0:00          \_ [action-suid] <defunct>
freyermu 18518 23.0  0.0 125228  4676 ?        SNs  18:23   0:00          \_ sshd: freyermu [priv]
freyermu 18539  0.0  0.0 125228  1796 ?        SN   18:23   0:00              \_ sshd: freyermu@pts/2
freyermu 18540  0.0  0.0  56000  4584 pts/2    SNs+ 18:23   0:00                  \_ /usr/bin/condor_docker_enter

In the logs, I only find:

Feb 26 18:23:01 wn022 condor_starter[18316]: Create_Process succeeded, pid=18518
Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting (soft) memory usage to 0 bytes
Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting memsw usage to 9223372036854775807 bytes
Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting (hard) memory usage to 104857600 bytes
Feb 26 18:23:01 wn022 condor_starter[18316]: Limiting memsw usage to 267144892416 bytes
Feb 26 18:23:01 wn022 condor_starter[18316]: Process exited, pid=18503, status=0
Feb 26 18:23:01 wn022 condor_starter[18316]: unhandled job exit: pid=18503, status=0
Feb 26 18:23:01 wn022 condor_starter[18316]: Accepted new connection from ssh client for container job
Feb 26 18:23:01 wn022 condor_starter[18316]: singularity enter_ns returned pid 18546
Feb 26 18:23:01 wn022 condor_starter[18316]: Process exited, pid=18402, status=255


Checking /usr/libexec/condor/condor_ssh_to_job_shell_setup, though, I find the code:

# kill the dummy sleep job if this is an interactive job
if grep -q '^InteractiveJob = true' "${_CONDOR_SCRATCH_DIR}/.job.ad"; then
  if [ "${_CONDOR_JOB_PIDS}" != "" ]; then
    kill "${_CONDOR_JOB_PIDS}" 2>/dev/null
        _CONDOR_JOB_PIDS=""
  fi
fi

So probably, this only fails for interactive jobs, since the sleep is reaped before we attach?
I can't test witha batch job right now since I am already in the middle of the downgrade (and we still lack a proper test setup), but I'll try.

Cheers,
	Oliver




-greg



--
Oliver Freyermuth
UniversitÃt Bonn
Physikalisches Institut, Raum 1.047
NuÃallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature