[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] vanilla jobs not starting under docker: condor 8.7.10



Hi Kristian,

are you sure the uid of the jobsubmitter is actually valid/mapped in the docker container ?

Best
Christoph 


-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


----- UrsprÃngliche Mail -----
Von: Kristian Kvilekval <kris@xxxxxxxx>
An: htcondor-users@xxxxxxxxxxx
Gesendet: Wed, 08 May 2019 21:21:04 +0200 (CEST)
Betreff: [HTCondor-users] vanilla jobs not starting under docker: condor	8.7.10

I am getting a strange error while starting simple jobs using workers in
docker
containers.. for reference universe=docker does work.

Note the line below "Failed to unshare the mount namespace errno"

Config is below also.

I've spent a  day looking at this.. losing hope.
Thanks,
Kris



root@condor-worker-43:/var/log/condor# cat StarterLog.slot1_1
05/08/19 19:09:38 (pid:166)
******************************************************
05/08/19 19:09:38 (pid:166) ** condor_starter (CONDOR_STARTER) STARTING UP
05/08/19 19:09:38 (pid:166) ** /usr/sbin/condor_starter
05/08/19 19:09:38 (pid:166) ** SubsystemInfo: name=STARTER type=STARTER(8)
class=DAEMON(1)
05/08/19 19:09:38 (pid:166) ** Configuration: subsystem:STARTER
local:<NONE> class:DAEMON
05/08/19 19:09:38 (pid:166) ** $CondorVersion: 8.7.10 Oct 31 2018 BuildID:
Debian-8.7.10-1 Debian-8.7.10-1 $
05/08/19 19:09:38 (pid:166) ** $CondorPlatform: X86_64-Debian_9 $
05/08/19 19:09:38 (pid:166) ** PID = 166
05/08/19 19:09:38 (pid:166) ** Log last touched time unavailable (No such
file or directory)
05/08/19 19:09:38 (pid:166)
******************************************************
05/08/19 19:09:38 (pid:166) Using config source: /etc/condor/condor_config
05/08/19 19:09:38 (pid:166) Using local config sources:
05/08/19 19:09:38 (pid:166)    /etc/condor/condor_config.local
05/08/19 19:09:38 (pid:166) config Macros = 79, Sorted = 78, StringBytes =
2184, TablesBytes = 2892
05/08/19 19:09:38 (pid:166) CLASSAD_CACHING is OFF
05/08/19 19:09:38 (pid:166) Daemon Log is logging: D_ALWAYS D_ERROR
05/08/19 19:09:38 (pid:166) SharedPortEndpoint: waiting for connections to
named socket 113_5e5e_3
05/08/19 19:09:38 (pid:166) DaemonCore: command socket at <
10.42.79.108:9886?addrs=10.42.79.108-9886&noUDP&sock=113_5e5e_3>
05/08/19 19:09:38 (pid:166) DaemonCore: private command socket at <
10.42.79.108:9886?addrs=10.42.79.108-9886&noUDP&sock=113_5e5e_3>
05/08/19 19:09:38 (pid:166) Communicating with shadow <
10.42.129.175:9886?addrs=10.42.129.175-9886&noUDP&sock=107_241d_1>
05/08/19 19:09:38 (pid:166) Submitting machine is
"ip-10-42-129-175.us-west-2.compute.internal"
05/08/19 19:09:38 (pid:166) setting the orig job name in starter
05/08/19 19:09:38 (pid:166) setting the orig job iwd in starter
05/08/19 19:09:38 (pid:166) Chirp config summary: IO false, Updates false,
Delayed updates true.
05/08/19 19:09:38 (pid:166) Initialized IO Proxy.
05/08/19 19:09:38 (pid:166) Done setting resource limits
05/08/19 19:09:39 (pid:166) File transfer completed successfully.
05/08/19 19:09:40 (pid:166) Job 1.0 set to execute immediately
05/08/19 19:09:40 (pid:166) Starting a VANILLA universe job with ID: 1.0
05/08/19 19:09:40 (pid:166) IWD: /var/lib/condor/execute/dir_166
05/08/19 19:09:40 (pid:166) Output file:
/var/lib/condor/execute/dir_166/_condor_stdout
05/08/19 19:09:40 (pid:166) Error file:
/var/lib/condor/execute/dir_166/_condor_stderr
05/08/19 19:09:40 (pid:166) Renice expr "0" evaluated to 0
05/08/19 19:09:40 (pid:166) About to exec
/var/lib/condor/execute/dir_166/condor_exec.exe
05/08/19 19:09:40 (pid:166) Running job as user nobody
05/08/19 19:09:40 (pid:170) Failed to unshare the mount namespace errno
05/08/19 19:09:40 (pid:166) Warning: Create_Process: failed to read child
process failure code
05/08/19 19:09:40 (pid:166)
Create_Process(/var/lib/condor/execute/dir_166/condor_exec.exe): child
failed with errno1 (Operation not permitted) before exec()
05/08/19 19:09:40 (pid:166)
Create_Process(/var/lib/condor/execute/dir_166/condor_exec.exe,, ...)
failed: (errno=1: 'Operation not permitted')
05/08/19 19:09:40 (pid:166) Failed to start job, exiting
05/08/19 19:09:40 (pid:166) ShutdownFast all jobs.
05/08/19 19:09:40 (pid:166) Failed to open '.update.ad' to read update ad:
No such file or directory (2).
05/08/19 19:09:40 (pid:166) condor_read() failed: recv(fd=8) returned -1,
errno = 104 Connection reset by peer, reading 5 bytes from <
10.42.129.175:33495>.
05/08/19 19:09:40 (pid:166) IO: Failed to read packet header
05/08/19 19:09:40 (pid:166) Lost connection to shadow, waiting 2400 secs
for reconnect
05/08/19 19:09:40 (pid:166) All jobs have exited... starter exiting
05/08/19 19:09:40 (pid:166) **** condor_starter (condor_STARTER) pid 166
EXITING WITH STATUS 0
root@condor-worker-43:/var/log/condor# apt-cache search libcgroup
libcgroup-dev - control and monitor control groups (development)
libcgroup1 - control and monitor control groups (library)
root@condor-worker-43:/var/log/condor# apt-cache policy  libcgroup1




CONDOR_HOST = master
#CONDOR_HOST = master
COLLECTOR_NAME = GRID
COLLECTOR_HOST = $(CONDOR_HOST):9886?sock=collector
DAEMON_LIST = MASTER,STARTD,SHARED_PORT
# DAEMON_LIST = MASTER, SCHEDD, STARTD
# DAEMON_LIST = MASTER, SCHEDD
##  When something goes wrong with condor at your site, who should get
##  the email?

CONDOR_ADMIN          = admins@xxxxxxxx
#UID_DOMAIN            = viqi.org
#TRUST_UID_DOMAIN      = True
#SOFT_UID_DOMAIN       = TRUE
#FILESYSTEM_DOMAIN     = viqi.org
##  Do you want to use NFS for file access instead of remote system calls
ALLOW_READ  = $(ALLOW_READ), 172.*, 10.*,
ALLOW_WRITE = $(ALLOW_WRITE), 172.*, 10.*,
ALLOW_NEGOTIATOR      = 172.*, 10.*,
#ALLOW_ADMINISTRATOR   = 172.*, 10.*,
#ALLOW_CONFIG          = 172.*,10.*,
#ALLOW_DAEMON          = 172.*,10.*,

# Use CCB with shared port so outside units can talk to
USE_SHARED_PORT = True
SHARED_PORT_ARGS = -p 9886
UPDATE_COLLECTOR_WITH_TCP = True
CCB_ADDRESS = $(COLLECTOR_HOST)
PRIVATE_NETWORK_NAME = VIQI
BIND_ALL_INTERFACES = True

SEC_DEFAULT_NEGOTIATION = NEVER
SEC_DEFAULT_AUTHENTICATION = NEVER
DISCARD_SESSION_KEYRING_ON_STARTUP = false
BASE_CGROUP =

#PER_JOB_NAMESPACES=False
#USE_PID_NAMESPACES=False
#USE_PROCD = false

# Slots for multi-cpu machines
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = true

START = True
PREEMPT = False
SUSPEND = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE= False
CONTINUE= True


-- 
Kris Kvilekval, Ph.D.
ViQi Inc
(805)-699-6081