[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] HTCondor 8.6.8, 8.6.9 and 8.7.5 Job Run Error: Create_Process failed to register the job with the ProcD



Hi,
ÂI'm building a testbed with Docker and HTCondor, I setup 2 nodes, 1 MasterSubmit and an Execute, all the installation is run from root user (in the container), and later a submit user is created, no errors are shown in the installation or in the condor_submit, but when the jobs start execution, I get this error in the Job's log file:
001 (001.000.000) 02/22 18:30:34 Job executing on host: <172.17.0.3:9619?addrs=172.17.0.3-9619&noUDP&sock=4704_71ff_3>
...
007 (001.000.000) 02/22 18:30:34 Shadow exception!
Error from slot1_1@xxxxxxxxxxxx: Create_Process failed to register the job with the ProcD
0Â -Â Run Bytes Sent By Job
1037713Â -Â Run Bytes Received By Job

It's weird due to with HTCondor releases 8.4.8 and 8.4.12, every thing works great, no errors, jobs run and finish, but I tryed from 8.6.8 until 8.7.5 and all of them return that error (exactly that same error)

The pool's config is:
Base Docker Container: Ubuntu 16.04
1 Master/Submit node in Docker container IP: 172.17.0.2
1 Execute node in Docker container ÂIP: 172.17.0.3

Both containers share this /etc/hosts file for name resolution:
127.0.0.1Â Â Â Âlocalhost
172.17.0.2Â Â Â htm.htc.dev htm
172.17.0.3Â Â Â htn1.htc.dev htn1

I checked the ShadowLog and found this error:
02/22/18 18:30:32 ******************************************************
02/22/18 18:30:32 ** condor_shadow (CONDOR_SHADOW) STARTING UP
02/22/18 18:30:32 ** /usr/sbin/condor_shadow
02/22/18 18:30:32 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
02/22/18 18:30:32 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
02/22/18 18:30:32 ** $CondorVersion: 8.6.9 Jan 03 2018 BuildID: 428149 $
02/22/18 18:30:32 ** $CondorPlatform: x86_64_Ubuntu14 $
02/22/18 18:30:32 ** PID = 4832
02/22/18 18:30:32 ** Log last touched 2/22 18:30:18
02/22/18 18:30:32 ******************************************************
02/22/18 18:30:32 Using config source: /etc/condor/condor_config
02/22/18 18:30:32 Using local config sources:Â
02/22/18 18:30:32Â Â /etc/condor/condor_config.local
02/22/18 18:30:32 config Macros = 68, Sorted = 68, StringBytes = 1758, TablesBytes = 1136
02/22/18 18:30:32 CLASSAD_CACHING is OFF
02/22/18 18:30:32 Daemon Log is logging: D_ALWAYS D_ERROR
02/22/18 18:30:32 SharedPortEndpoint: waiting for connections to named socket 4765_1f9b_14
02/22/18 18:30:32 DaemonCore: command socket at <172.17.0.2:9618?addrs=172.17.0.2-9618&noUDP&sock=4765_1f9b_14>
02/22/18 18:30:32 DaemonCore: private command socket at <172.17.0.2:9618?addrs=172.17.0.2-9618&noUDP&sock=4765_1f9b_14>
02/22/18 18:30:32 Initializing a VANILLA shadow for job 1.0
02/22/18 18:30:32 (1.0) (4832): Request to run on slot1_1@xxxxxxxxxxxx <172.17.0.3:9619?addrs=172.17.0.3-9619&noUDP&sock=4704_71ff_3> was ACCEPTED
02/22/18 18:30:33 (1.0) (4832): File transfer completed successfully.
02/22/18 18:30:34 (1.0) (4832): ERROR "Error from slot1_1@xxxxxxxxxxxx: Create_Process failed to register the job with the ProcD" at line 608 in file /slots/01/dir_1624282/sources/src/condor_shadow.V6.1/pseudo_ops.cpp

And StarterLog.slot1_1 show this:
02/22/18 18:30:32 (pid:4819) ******************************************************
02/22/18 18:30:32 (pid:4819) ** condor_starter (CONDOR_STARTER) STARTING UP
02/22/18 18:30:32 (pid:4819) ** /usr/sbin/condor_starter
02/22/18 18:30:32 (pid:4819) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
02/22/18 18:30:32 (pid:4819) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
02/22/18 18:30:32 (pid:4819) ** $CondorVersion: 8.6.9 Jan 03 2018 BuildID: 428149 $
02/22/18 18:30:32 (pid:4819) ** $CondorPlatform: x86_64_Ubuntu14 $
02/22/18 18:30:32 (pid:4819) ** PID = 4819
02/22/18 18:30:32 (pid:4819) ** Log last touched 2/22 18:30:18
02/22/18 18:30:32 (pid:4819) ******************************************************
02/22/18 18:30:32 (pid:4819) Using config source: /etc/condor/condor_config
02/22/18 18:30:32 (pid:4819) Using local config sources:Â
02/22/18 18:30:32 (pid:4819)Â Â /etc/condor/condor_config.local
02/22/18 18:30:32 (pid:4819) config Macros = 87, Sorted = 86, StringBytes = 2810, TablesBytes = 3180
02/22/18 18:30:32 (pid:4819) CLASSAD_CACHING is OFF
02/22/18 18:30:32 (pid:4819) Daemon Log is logging: D_ALWAYS D_ERROR
02/22/18 18:30:32 (pid:4819) SharedPortEndpoint: waiting for connections to named socket 4739_cdda_14
02/22/18 18:30:32 (pid:4819) DaemonCore: command socket at <172.17.0.3:9619?addrs=172.17.0.3-9619&noUDP&sock=4739_cdda_14>
02/22/18 18:30:32 (pid:4819) DaemonCore: private command socket at <172.17.0.3:9619?addrs=172.17.0.3-9619&noUDP&sock=4739_cdda_14>
02/22/18 18:30:32 (pid:4819) Communicating with shadow <172.17.0.2:9618?addrs=172.17.0.2-9618&noUDP&sock=4765_1f9b_14>
02/22/18 18:30:32 (pid:4819) Submitting machine is "htm.htc.dev"
02/22/18 18:30:32 (pid:4819) setting the orig job name in starter
02/22/18 18:30:32 (pid:4819) setting the orig job iwd in starter
02/22/18 18:30:32 (pid:4819) passwd_cache::cache_uid(): getpwnam("condor1") failed: user not found
02/22/18 18:30:32 (pid:4819) Chirp config summary: IO false, Updates false, Delayed updates true.
02/22/18 18:30:32 (pid:4819) Initialized IO Proxy.
02/22/18 18:30:32 (pid:4819) Done setting resource limits
02/22/18 18:30:33 (pid:4819) File transfer completed successfully.
02/22/18 18:30:34 (pid:4819) Job 1.0 set to execute immediately
02/22/18 18:30:34 (pid:4819) Starting a VANILLA universe job with ID: 1.0
02/22/18 18:30:34 (pid:4819) IWD: /var/lib/condor/execute/dir_4819
02/22/18 18:30:34 (pid:4819) Output file: /var/lib/condor/execute/dir_4819/_condor_stdout
02/22/18 18:30:34 (pid:4819) Error file: /var/lib/condor/execute/dir_4819/_condor_stderr
02/22/18 18:30:34 (pid:4819) Renice expr "0" evaluated to 0
02/22/18 18:30:34 (pid:4819) About to exec /var/lib/condor/execute/dir_4819/condor_exec.exe test.bash 61
02/22/18 18:30:34 (pid:4819) Running job as user same uid as parent: personal condor
02/22/18 18:30:34 (pid:4823) Result of "track_family_via_cgroup" operation from ProcD: ERROR: No cgroup available for tracking
02/22/18 18:30:34 (pid:4823) Create_Process: error tracking family with root 4823 via cgroup htcondor/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxx
02/22/18 18:30:34 (pid:4819) Create_Process(/var/lib/condor/execute/dir_4819/condor_exec.exe): child failed because it failed to register itself with the ProcD
02/22/18 18:30:34 (pid:4819) ERROR "Create_Process failed to register the job with the ProcD" at line 632 in file /slots/01/dir_1624282/sources/src/condor_starter.V6.1/os_proc.cpp
02/22/18 18:30:34 (pid:4819) ShutdownFast all jobs.
02/22/18 18:30:34 (pid:4819) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <172.17.0.2:34841>.
02/22/18 18:30:34 (pid:4819) IO: Failed to read packet header
02/22/18 18:30:34 (pid:4819) Lost connection to shadow, waiting 2400 secs for reconnect

This is the condor_config.local file for MasterNode:
##### VALORES AGREGADOS POR htconfig_v2.py el dia: 13/02/2018 18:38:58 #####
# Condor Master
CONDOR_HOST = $(FULL_HOSTNAME)

# Type: Condor Master & Schedd
DAEMON_LIST = MASTER,COLLECTOR,NEGOTIATOR,SCHEDD,SHARED_PORT

# Contact's email / email de contacto
CONDOR_ADMIN = root@$(FULL_HOSTNAME)

# User ID Domain
UID_DOMAIN = htc.dev

# Filesystem Domain
FILESYSTEM_DOMAIN = htc.dev

# Deshabilitar uso de Swap / Disable Swap use.
RESERVED_SWAP = 0

# Allowed computers / Equipos permitidos
ALLOW_WRITE = *.htc.dev,172.17.*

# Enable use a Shared port / Habilitar uso de un Shared Port
USE_SHARED_PORT = True

#Fix for docker
DISCARD_SESSION_KEYRING_ON_STARTUP=False

And this is condor_config.local for ExecuteNode:
##### VALORES AGREGADOS POR htconfig_v2.py el dia: 13/02/2018 18:43:32 #####
# Condor Master
CONDOR_HOST = htm.htc.dev

# Type: Condor Worker
DAEMON_LIST = MASTER,STARTD,SHARED_PORT

# Contact's email / email de contacto
CONDOR_ADMIN = root@$(FULL_HOSTNAME)

# User ID Domain
UID_DOMAIN = htc.dev

# Filesystem Domain
FILESYSTEM_DOMAIN = htc.dev

# Deshabilitar uso de Swap / Disable Swap use.
RESERVED_SWAP = 0

# Allowed computers / Equipos permitidos
ALLOW_WRITE = *.htc.dev,172.17.*

# Enable use a Shared port / Habilitar uso de un Shared Port
USE_SHARED_PORT = True

# Processes different to Collector use port 9619
# Procesos diferentes a Collector usar puerto 9619
SHARED_PORT_ARGS = -p 9619

# Create required Slots / Crear Slots requeridos
NUM_SLOTS = 1

# Dynamic Slot / Slot Dinamico
# Use only available resources for the Slot / usar solo los recursos disponibles para el Slot
SLOT_TYPE_1 = cpu=auto, ram=auto
# Enable dynamic resources in this Slot / Habilitar recursos dinamicos en este Slot
SLOT_TYPE_1_PARTITIONABLE = True
# Create Slot / Crear Slot
NUM_SLOTS_TYPE_1 = 1
# Always run jobs in this slot / Siempre ejecutar tareas en este slot
SLOT_TYPE_1_START = True

# Uncomment for Debug / Descomente para depuracion
#STARTD_DEBUG = D_FULLDEBUG

# Enable unexistent user jobsÂ
# Permitir tareas de usuarios no existentes
SHADOW_RUN_UNKNOWN_USER_JOBS = True
SOFT_UID_DOMAIN = True

#Fix for docker
DISCARD_SESSION_KEYRING_ON_STARTUP=False
----------------------

This config files were used to test releases 8.4.8, 8.4.12, 8.6.8, 8.6.9 and 8.7.5, and only 8.4.X didn't failed.

Any idea why I'm getting this errors and how can I fix them?

Thanks in advance.
--