[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_ssh_to_job to a flocked job



Hi

Hi all,
We have two HTCondor pools and flock jobs from one cluster to the other. The submit node runs with 9.1.2, while the worker nodes we flock to run 9.0.13. I'll try condor_ssh_to_job to a running flocked job at the other pool. The jobs run inside a docker container as user nobody.
When I use condor_ssh_to_job as root user on the submit machine, it works fine, and I'm inside the docker container. Independent of whom submitted the job.
When an ordinary user tries to ssh into a flocked job, it gets after a while, "Failed to connect to starter". condor_ssh_to_job works fine within the cluster the job was submitted.

I looked at the StarterLog (see below), and it seems that it gets stuck by ordinary users. After "Created security session for job owner", the starter queries docker regularly but nothing else. After "Created security session for job owner" condor runs a "docker exec -it ..." when the user root runs condor_ssh_to_job.

Could this be a problem with authentication? I did not find any security message in the logs that looks problematic.

Best regards,
Matthias


ordinary user (mschnepf)

08/09/22 15:53:03 (pid:3549512) Created security session for job owner (mschnepf@xxxxxxxxxxx).
08/09/22 15:53:06 (pid:3549512) condor_read(): Socket closed when trying to read 1 bytes from Docker Socket
08/09/22 15:53:06 (pid:3549512) sendDockerAPIRequest(GET /containers/HTCJob1540489_0_slot1_2_PID3549512/stats?stream=0 HTTP/1.0

) = HTTP/1.0 200 OK
Api-Version: 1.41
Content-Type: application/json
Docker-Experimental: false
Ostype: linux
Server: Docker/20.10.12 (linux)
Date: Tue, 09 Aug 2022 13:53:06 GMT

{"read":"2022-08-09T13:53:06.189708447Z","preread":"2022-08-09T13:53:05.184780604Z","pids_stats":{"current":2},"blkio_stats":{"io_service_bytes_recursive":[{"major":259,"minor":2,"op":"Read","value":3321856},{"major":259,"minor":2,"op":"Write","value":0},{"major":259,"minor":2,"op":"Sync","value":0},{"major":259,"m
inor":2,"op":"Async","value":3321856},{"major":259,"minor":2,"op":"Total","value":3321856},{"major":253,"minor":2,"op":"Read","value":3321856},{"major":253,"minor":2,"op":"Write","value":0},{"major":253,"minor":2,"op":"Sync","value":0},{"major":253,"minor":2,"op":"Async","value":3321856},{"major":253,"minor":2,"op"
:"Total","value":3321856}],"io_serviced_recursive":[{"major":259,"minor":2,"op":"Read","value":70},{"major":259,"minor":2,"op":"Write","value":0},{"major":259,"minor":2,"op":"Sync","value":0},{"major":259,"minor":2,"op":"Async","value":70},{"major":259,"minor":2,"op":"Total","value":70},{"major":253,"minor":2,"op":
"Read","value":70},{"major":253,"minor":2,"op":"Write","value":0},{"major":253,"minor":2,"op":"Sync","value":0},{"major":253,"minor":2,"op":"Async","value":70},{"major":253,"minor":2,"op":"Total","value":70}],"io_queue_recursive":[],"io_service_time_recursive":[],"io_wait_time_recursive":[],"io_merged_recursive":[]
,"io_time_recursive":[],"sectors_recursive":[]},"num_procs":0,"storage_stats":{},"cpu_stats":{"cpu_usage":{"total_usage":95595265,"percpu_usage":[0,1167765,636512,2089669,0,603600,0,0,0,0,0,0,120723,3280308,2509666,885261,114234,2095969,223605,108435,153023,0,145928,3664141,0,68348505,1848270,0,0,0,0,0,0,0,0,0,0,58
24522,475131,1218407,81591,0,0,0,0,0,0,0],"usage_in_kernelmode":40000000,"usage_in_usermode":50000000},"system_cpu_usage":54830782350000000,"online_cpus":48,"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"precpu_stats":{"cpu_usage":{"total_usage":95595265,"percpu_usage":[0,1167765,636512,
2089669,0,603600,0,0,0,0,0,0,120723,3280308,2509666,885261,114234,2095969,223605,108435,153023,0,145928,3664141,0,68348505,1848270,0,0,0,0,0,0,0,0,0,0,5824522,475131,1218407,81591,0,0,0,0,0,0,0],"usage_in_kernelmode":40000000,"usage_in_usermode":50000000},"system_cpu_usage":54830734180000000,"online_cpus":48,"throt
tling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"memory_stats":{"usage":3321856,"max_usage":10293248,"stats":{"active_anon":286720,"active_file":823296,"cache":3035136,"dirty":0,"hierarchical_memory_limit":3145728000,"hierarchical_memsw_limit":6291456000,"inactive_anon":0,"inactive_file":2211840
,"mapped_file":1232896,"pgfault":3968,"pgmajfault":28,"pgpgin":1868,"pgpgout":1057,"rss":286720,"rss_huge":0,"total_active_anon":286720,"total_active_file":823296,"total_cache":3035136,"total_dirty":0,"total_inactive_anon":0,"total_inactive_file":2211840,"total_mapped_file":1232896,"total_pgfault":0,"total_pgmajfau
lt":0,"total_pgpgin":0,"total_pgpgout":0,"total_rss":286720,"total_rss_huge":0,"total_unevictable":0,"total_writeback":0,"unevictable":0,"writeback":0},"limit":3145728000},"name":"/HTCJob1540489_0_slot1_2_PID3549512","id":"386bfb25118e13fe30ae1e629705cb64903e866138a8fc6e756b063e388cf183","networks":{"eth0":{"rx_byt
es":746,"rx_packets":7,"rx_errors":0,"rx_dropped":0,"tx_bytes":656,"tx_packets":8,"tx_errors":0,"tx_dropped":0}}}

08/09/22 15:53:06 (pid:3549512) docker stats reports max_usage is 286720 rx_bytes is 746 tx_bytes is 656 usage_in_usermode is 50000000 usage_in-sysmode is 40000000
08/09/22 15:53:12 (pid:3549512) condor_read(): Socket closed when trying to read 1 bytes from Docker Socket
08/09/22 15:53:12 (pid:3549512) sendDockerAPIRequest(GET /containers/HTCJob1540489_0_slot1_2_PID3549512/stats?stream=0 HTTP/1.0


User condor

08/09/22 15:42:16 (pid:3545932) Created security session for job owner (condor@xxxxxxxxxxx).
08/09/22 15:42:16 (pid:3545932) DockerProc::PublishToEnv()
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment proto 'GPU_DEVICE_ORDINAL=/(CUDA|OCL)//Â CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment 'GPU_DEVICE_ORDINAL' pattern: /(CUDA|OCL)//Â CUDA_VISIBLE_DEVICES=/CUDA//
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment 'GPU_DEVICE_ORDINAL' no-match of pattern: (CUDA|OCL)
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment proto 'CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment 'CUDA_VISIBLE_DEVICES' pattern: /CUDA//
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment 'CUDA_VISIBLE_DEVICES' no-match of pattern: CUDA
08/09/22 15:42:16 (pid:3545932) Checking preferred shells: /bin/bash
08/09/22 15:42:16 (pid:3545932) Will use shell /bin/bash
08/09/22 15:42:16 (pid:3545932) StartSSHD: session_dir='/var/lib/condor/execute/dir_3545932/.condor_ssh_to_job_1'
08/09/22 15:42:16 (pid:3545932) Setting LD_PRELOAD=/usr/lib64/condor/libgetpwnam.so for sshd
08/09/22 15:42:16 (pid:3545932) In OsProc::OsProc()
08/09/22 15:42:16 (pid:3545932) Main job KillSignal: 15 (SIGTERM)
08/09/22 15:42:16 (pid:3545932) Main job RmKillSignal: 15 (SIGTERM)
08/09/22 15:42:16 (pid:3545932) Main job HoldKillSignal: 15 (SIGTERM)
08/09/22 15:42:16 (pid:3545932) in SSHDProc::StartJob()
08/09/22 15:42:16 (pid:3545932) in VanillaProc::StartJob()
08/09/22 15:42:16 (pid:3545932) Requesting cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxx/sshd for job.
08/09/22 15:42:16 (pid:3545932) Value of RequestedChroot is unset.
08/09/22 15:42:16 (pid:3545932) Adding mapping: /var/lib/condor/execute/dir_3545932/tmp/ -> /tmp.
08/09/22 15:42:16 (pid:3545932) Checking the mapping of mount point /tmp.
08/09/22 15:42:16 (pid:3545932) Current mount, /, is shared.
08/09/22 15:42:16 (pid:3545932) Adding mapping: /var/lib/condor/execute/dir_3545932/var/tmp/ -> /var/tmp.
08/09/22 15:42:16 (pid:3545932) Checking the mapping of mount point /var/tmp.
08/09/22 15:42:16 (pid:3545932) Current mount, /var, is shared.
08/09/22 15:42:16 (pid:3545932) PID namespace option: false
08/09/22 15:42:16 (pid:3545932) in OsProc::StartJob()
08/09/22 15:42:16 (pid:3545932) IWD: /var/lib/condor/execute/dir_3545932
08/09/22 15:42:16 (pid:3545932) DockerProc::PublishToEnv()
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment proto 'GPU_DEVICE_ORDINAL=/(CUDA|OCL)//Â CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment 'GPU_DEVICE_ORDINAL' pattern: /(CUDA|OCL)//Â CUDA_VISIBLE_DEVICES=/CUDA//
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment 'GPU_DEVICE_ORDINAL' no-match of pattern: (CUDA|OCL)
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment proto 'CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment 'CUDA_VISIBLE_DEVICES' pattern: /CUDA//
08/09/22 15:42:16 (pid:3545932) AssignedGPUs environment 'CUDA_VISIBLE_DEVICES' no-match of pattern: CUDA
08/09/22 15:42:16 (pid:3545932) Error file: /var/lib/condor/execute/dir_3545932/.condor_ssh_to_job_1/sshd.log
08/09/22 15:42:16 (pid:3545932) Renice expr "10" evaluated to 10
08/09/22 15:42:16 (pid:3545932) Env = _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_3545932 CUDA_VISIBLE_DEVICES=10000 _CONDOR_SHELL=/bin/bash _CONDOR_SLOT=slot1_1 OPENBLAS_NUM_THREADS=1 TF_LOOP_PARALLEL_ITERATIONS=1 NUMEXPR_NUM_THREADS=1 TMPDIR=/tmp TEMP=/tmp GPU_DEVICE_ORDINAL=10000 _CHIRP_DELAYED_UPDATE_PREFIX=Chi
rp* _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_3545932 CUBACORES=1 BATCH_SYSTEM=HTCondor _CONDOR_AssignedGPUs=10000 GOMAXPROCS=1 OMP_THREAD_LIMIT=1 TMP=/tmp _CONDOR_WRAPPER_ERROR_FILE=/var/lib/condor/execute/dir_3545932/.job_wrapper_failure _CONDOR_SLOT_NAME=slot1@xxxxxxxxxxxxxxxxxxxxxxx JULIA_NUM_THREADS=1 _C
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 TF_NUM_THREADS=1 _CONDOR_JOB_PIDS=3545960 _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_3545932/.chirp.config LD_PRELOAD=/usr/lib64/condor/libgetpwnam.so _CONDOR_JOB_AD=/var/lib/condor/execute/dir_3545932/.job.ad _CONDOR_MACHINE_AD=/var/lib/condor/execute/di
r_3545932/.machine.ad
08/09/22 15:42:16 (pid:3545932) ENFORCE_CPU_AFFINITY not true, not setting affinity
08/09/22 15:42:16 (pid:3545932) Running job as user nobody
08/09/22 15:42:16 (pid:3545932) Using wrapper /usr/libexec/condor/jobwrapper.sh to exec /usr/sbin/sshd -i -e -f /var/lib/condor/execute/dir_3545932/.condor_ssh_to_job_1/sshd_config
08/09/22 15:42:16 (pid:3546902) track_family_via_cgroup: Tracking PID 3546902 via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxx/sshd.
08/09/22 15:42:16 (pid:3546902) About to tell ProcD to track family with root 3546902 via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxxxxxxxxxxx/sshd
08/09/22 15:42:16 (pid:3546902) Mounting /dev/shm as a private mount successful.
08/09/22 15:42:17 (pid:3545932) Create_Process succeeded, pid=3546902
08/09/22 15:42:17 (pid:3545932) Initializing cgroup library.
08/09/22 15:42:17 (pid:3545932) Limiting (soft) memory usage to 0 bytes
08/09/22 15:42:17 (pid:3545932) Limiting memsw usage to 9223372036854775807 bytes
08/09/22 15:42:17 (pid:3545932) Limiting (hard) memory usage to 404154744832 bytes
08/09/22 15:42:17 (pid:3545932) Limiting (soft) memory usage to 3145728000 bytes
08/09/22 15:42:17 (pid:3545932) Subscribed the starter to OOM notification for this cgroup; jobs triggering an OOM will be put on hold.
08/09/22 15:42:17 (pid:3545932) Process exited, pid=3546890, status=0
08/09/22 15:42:17 (pid:3545932) Reaper: all=2 handled=0 ShuttingDown=0
08/09/22 15:42:17 (pid:3545932) unhandled job exit: pid=3546890, status=0
08/09/22 15:42:17 (pid:3545932) Accepted new connection from ssh client for docker job
08/09/22 15:42:17 (pid:3545932) DockerProc::PublishToEnv()
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment proto 'GPU_DEVICE_ORDINAL=/(CUDA|OCL)//Â CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment 'GPU_DEVICE_ORDINAL' pattern: /(CUDA|OCL)//Â CUDA_VISIBLE_DEVICES=/CUDA//
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment 'GPU_DEVICE_ORDINAL' no-match of pattern: (CUDA|OCL)
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment proto 'CUDA_VISIBLE_DEVICES=/CUDA//'
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment 'CUDA_VISIBLE_DEVICES' pattern: /CUDA//
08/09/22 15:42:17 (pid:3545932) AssignedGPUs environment 'CUDA_VISIBLE_DEVICES' no-match of pattern: CUDA
08/09/22 15:42:17 (pid:3545932) adding 27 environment vars to docker args
08/09/22 15:42:17 (pid:3545932) execing: /etc/condor/scripts/docker_wrapper.py exec -ti -e _CONDOR_JOB_IWD=/var/lib/condor/execute/dir_3545932 -e CUDA_VISIBLE_DEVICES=10000 -e _CONDOR_SLOT=slot1_1 -e OPENBLAS_NUM_THREADS=1 -e TF_LOOP_PARALLEL_ITERATIONS=1 -e NUMEXPR_NUM_THREADS=1 -e TMPDIR=/tmp -e TEMP=/tmp -e GPU_
DEVICE_ORDINAL=10000 -e _CHIRP_DELAYED_UPDATE_PREFIX=Chirp* -e _CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_3545932 -e CUBACORES=1 -e BATCH_SYSTEM=HTCondor -e _CONDOR_AssignedGPUs=10000 -e GOMAXPROCS=1 -e OMP_THREAD_LIMIT=1 -e TMP=/tmp -e _CONDOR_WRAPPER_ERROR_FILE=/var/lib/condor/execute/dir_3545932/.job_wrappe
r_failure -e JULIA_NUM_THREADS=1 -e _CONDOR_BIN=/usr/bin -e MKL_NUM_THREADS=1 -e OMP_NUM_THREADS=1 -e TF_NUM_THREADS=1 -e _CONDOR_JOB_PIDS=3545960\ 3546902 -e _CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_3545932/.chirp.config -e _CONDOR_JOB_AD=/var/lib/condor/execute/dir_3545932/.job.ad -e _CONDOR_MACHINE_AD=/v
ar/lib/condor/execute/dir_3545932/.machine.ad HTCJob1540488_0_slot1_1_PID3545932 /bin/bash -i
08/09/22 15:42:17 (pid:3545932) docker exec returned 0 for pid 3546935