[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Failed DISCARD_SESSION_KEYRING_ON_STARTUP error when running under docker 1.10



Hi there!

I've been working on getting HTCondor (8.4.9) components running inside docker containers. What I have is the Negotiator, Schedd and Collector daemons running in a single container on one VM, the Startd on another VM, and a client container on another VM with just HTCondor binaries and configuration that allows it to submit jobs to my test cluster.

Everything was working fine until I updated from docker version 1.08 to docker 1.10. When running any of the HTCondor images that launches the condor_master on docker 1.10, it fails with the following error:

---
12/13/16 18:04:18 (fd:7) (pid:69) (D_ALWAYS|D_FAILURE) ERROR "Failed DISCARD_SESSION_KEYRING_ON_STARTUP=True errno=1" at line 437 in file /slots/09/dir_3266446/userdir/.tmpiYeM9h/BUILD/condor-8.4.10/src/condor_master.V6/master.cpp
---

The 3 lines prior to the last ERROR line are normal and are seen in successful starts on docker 1.08 and are provided for a bit of context. I should also mention the base image used for these docker images is RHEL 7.2 and that this issue occurs whether the image was built on 1.08 or 1.10.

I couldn't find a whole lot of newer threads on this mailing list about HTCondor and dockers that wasn't about running jobs in the docker universe, but if someone can provide some insight into the error or point me in the right direction, it would be much appreciated.

Thank you!


And in case it helps any, the complete output of the condor_master before the failure is below.


12/13/16 18:04:17 (fd:3) (pid:69) (D_SECURITY) KEYCACHE: created: 0xc189b0
12/13/16 18:04:17 (fd:3) (pid:69) (D_CONFIG) config: using subsystem 'MASTER', local ''
12/13/16 18:04:17 (fd:3) (pid:69) (D_LOAD) Reading from /proc/cpuinfo
12/13/16 18:04:17 (fd:3) (pid:69) (D_LOAD) Found: Physical-IDs:True; Core-IDs:True
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME) NETWORK_INTERFACE=* matches lo 127.0.0.1, eth0 172.17.0.1, choosing IP 172.17.0.1
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME)ÂÂÂ I like it.
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME) hostname: cb19431d5c3e (score 4) new winner
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME) I am: hostname: cb19431d5c3e, fully qualified doman name: cb19431d5c3e, IP: 172.17.0.1, IPv4: 172.17.0.1, IPv6:
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME) Trying to getting network interface informations (after reading config)
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME) NETWORK_INTERFACE=* matches lo 127.0.0.1, eth0 172.17.0.1, choosing IP 172.17.0.1
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME) NETWORK_HOSTNAME says we are nova-docker02.anim.dreamworks.com
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME) NETWORK_INTERFACE=* matches lo 127.0.0.1, eth0 172.17.0.1, choosing IP 172.17.0.1
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME)ÂÂÂ I like it.
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME) hostname: nova-docker02.anim.dreamworks.com (score 4) new winner
12/13/16 18:04:17 (fd:3) (pid:69) (D_HOSTNAME) I am: hostname: nova-docker02, fully qualified doman name: nova-docker02.anim.dreamworks.com, IP: 172.17.0.1, IPv4: 172.17.0.1, IPv6:
12/13/16 18:04:17 (fd:3) (pid:69) (D_PRIV) PRIV_UNKNOWN --> PRIV_CONDOR at /slots/09/dir_3266446/userdir/.tmpiYeM9h/BUILD/condor-8.4.10/src/condor_daemon_core.V6/daemon_core_main.cpp:2204
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ******************************************************
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ** condor_master (CONDOR_MASTER) STARTING UP
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ** /usr/sbin/condor_master
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ** $CondorVersion: 8.4.10 Dec 13 2016 BuildID: 390598 $
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ** $CondorPlatform: x86_64_RedHat7 $
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ** PID = 69
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ** Log last touched time unavailable (Success)
12/13/16 18:04:17 (fd:3) (pid:69) (D_PRIV) ** Running as root: Privilege switching in effect
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) ******************************************************
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) Using config source: /etc/condor/condor_config
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) Using local config sources:
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS)ÂÂÂ /etc/condor/config.d/00_defines
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS)ÂÂÂ /etc/condor/config.d/01_general
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS)ÂÂÂ /etc/condor/config.d/02_security
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS)ÂÂÂ /etc/condor/config.d/03_dockerized
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS)ÂÂÂ /etc/condor/config.d/15_startd
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS)ÂÂÂ /etc/condor/config.d/15_startd_slots
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS)ÂÂÂ /etc/condor/config.d/16_starter
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS)ÂÂÂ /etc/condor/config.d/99_env
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS)ÂÂÂ /etc/condor/condor_config.local
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) config Macros = 98, Sorted = 98, StringBytes = 2851, TablesBytes = 3632
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) CLASSAD_CACHING is OFF
12/13/16 18:04:17 (fd:3) (pid:69) (D_ALWAYS) Daemon Log is logging: D_ALL
12/13/16 18:04:17 (fd:6) (pid:69) (D_PRIV) PRIV_CONDOR --> PRIV_ROOT at /slots/09/dir_3266446/userdir/.tmpiYeM9h/BUILD/condor-8.4.10/src/condor_master.V6/master.cpp:259
12/13/16 18:04:17 (fd:7) (pid:69) (D_PRIV) PRIV_ROOT --> PRIV_CONDOR at /slots/09/dir_3266446/userdir/.tmpiYeM9h/BUILD/condor-8.4.10/src/condor_master.V6/master.cpp:261
12/13/16 18:04:17 (fd:6) (pid:70) (D_PRIV) PRIV_CONDOR --> PRIV_ROOT at /slots/09/dir_3266446/userdir/.tmpiYeM9h/BUILD/condor-8.4.10/src/condor_master.V6/master.cpp:301
12/13/16 18:04:18 (fd:6) (pid:69) (D_DAEMONCORE) Setting up command socket
12/13/16 18:04:18 (fd:6) (pid:69) (D_DAEMONCORE) CONDOR_INHERIT: is NULL
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) in DaemonCore NewTimer()
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=0
12/13/16 18:04:18 (fd:7) (pid:69) (D_ALWAYS) SharedPortEndpoint: waiting for connections to named socket 69_e81f
12/13/16 18:04:18 (fd:7) (pid:69) (D_ALWAYS) SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
12/13/16 18:04:18 (fd:7) (pid:69) (D_ALWAYS) SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) in DaemonCore NewTimer()
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=1
12/13/16 18:04:18 (fd:7) (pid:69) (D_ALWAYS) DaemonCore: private command socket at <172.17.0.1:0?sock=69_e81f>
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) Cancel_Signal: signal 17 not found
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) in DaemonCore NewTimer()
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=2
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) in DaemonCore NewTimer()
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=3
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) in DaemonCore NewTimer()
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=4
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) in DaemonCore NewTimer()
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=5
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) in DaemonCore NewTimer()
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=6
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) in DaemonCore NewTimer()
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) leaving DaemonCore NewTimer, id=7
12/13/16 18:04:18 (fd:7) (pid:69) (D_HOSTNAME) COLLECTOR_HOST is set to "nova-docker02.anim.dreamworks.com"
12/13/16 18:04:18 (fd:7) (pid:69) (D_DAEMONCORE) *** TIMEOUT_MULTIPLIER :: 0
12/13/16 18:04:18 (fd:7) (pid:69) (D_HOSTNAME) Checking if nova-docker02.anim.dreamworks.com is a sinful address
12/13/16 18:04:18 (fd:7) (pid:69) (D_HOSTNAME) nova-docker02.anim.dreamworks.com is not a sinful address: does not begin with "<"
12/13/16 18:04:18 (fd:7) (pid:69) (D_HOSTNAME) New Daemon obj (collector) name: "nova-docker02.anim.dreamworks.com", pool: "NULL", addr: "NULL"
12/13/16 18:04:18 (fd:7) (pid:69) (D_HOSTNAME) Using name "nova-docker02.anim.dreamworks.com" to find daemon
12/13/16 18:04:18 (fd:7) (pid:69) (D_HOSTNAME) Port not specified, using default (9618)
12/13/16 18:04:18 (fd:7) (pid:69) (D_HOSTNAME) Host info "nova-docker02.anim.dreamworks.com" is a hostname, finding IP address
12/13/16 18:04:18 (fd:7) (pid:69) (D_HOSTNAME) Found IP address and port <192.168.162.224:9618>
12/13/16 18:04:18 (fd:7) (pid:69) (D_HOSTNAME) Daemon client (collector) address determined: name: "nova-docker02.anim.dreamworks.com", pool: "nova-docker02.anim.dreamworks.com", alias: "nova-docker02.anim.dreamworks.com", addr: "<192.168.162.224:9618>"
12/13/16 18:04:18 (fd:7) (pid:69) (D_ALWAYS|D_FAILURE) ERROR "Failed DISCARD_SESSION_KEYRING_ON_STARTUP=True errno=1" at line 437 in file /slots/09/dir_3266446/userdir/.tmpiYeM9h/BUILD/condor-8.4.10/src/condor_master.V6/master.cpp