[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] startd doesn't start



I've attached the diff of the output of the condor_config_val -dump in case it can help.

On 17/03/2017 09:28, Alessandra Forti wrote:
Hi,

I'm in a bit of a pickle and can't understand what I'm doing wrong. I have two small testbeds which I should have the same configuration and one works and the other doesn't. They both are configured with puppet.

The one that doesn't work is condor-8.6.1 the one that works is condor-8.4.11.

They are both started by root, on both the UID domain is set to the same value both on the head node and the pool node (as a matter of fact startd doesn't start on the head node either), the both have the same pool_password, but there are some differences. For example the 8.6.1 condor_shared_p starts automatically while in 8.4.11 it doesn't. We don't  The pool_password are created differently that's why I stuck with the one that worked on at least one testbed. I can see startd starting for few seconds and then dying or, according to the logs, getting killed

In the StartLog files I have this error

03/17/17 08:20:35 ERROR: Attempt to initialize user_priv with root privileges rejected
03/17/17 08:20:35 ERROR "Programmer Error: attempted switch to user privilege, but user ids are not initialized" at line 1500 in file

While the MasterLog I have an endless series of these messages

03/17/17 03:20:33 restarting /usr/sbin/condor_startd in 3600 seconds
03/17/17 04:20:33 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 2717119
03/17/17 04:20:34 DefaultReaper unexpectedly called on pid 2717119, status 1024.
03/17/17 04:20:34 The STARTD (pid 2717119) exited with status 4
03/17/17 04:20:34 restarting /usr/sbin/condor_startd in 3600 seconds
03/17/17 05:20:34 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 2723991
03/17/17 05:20:35 DefaultReaper unexpectedly called on pid 2723991, status 1024.
03/17/17 05:20:35 The STARTD (pid 2723991) exited with status 4

I can only find references to these errors that are pretty old or not applicable.

thanks for any help

cheers
alessandra

-- 
Respect is a rational process. \\//
Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
For Ur-Fascism, disagreement is treason. (U. Eco)
But but but her emails... (Anonymous)


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Respect is a rational process. \\//
Fatti non foste a viver come bruti, ma per seguir virtute e canoscenza(Dante)
For Ur-Fascism, disagreement is treason. (U. Eco)
But but but her emails... (Anonymous)
diff condor-8.4.11-dump condor-8.6.1-dump
1c1
< # Configuration from machine: vm29.tier2.hep.manchester.ac.uk
---
> # Configuration from machine: vm3.in.tier2.hep.manchester.ac.uk
10a11
> ADVERTISE_IPV4_FIRST = $(PREFER_IPV4)
26a28
> ALWAYS_REUSEADDR = true
28a31,33
> ANNEXD = $(SBIN)/condor_annexd
> ANNEXD_LOG = $(LOG)/AnnexdLog
> ANNEXD_NAME = Annex Daemon@$(FULL_HOSTNAME)
40c45
< BASE_CGROUP = 
---
> BASE_CGROUP = htcondor
53a59
> C_GAHP_DEBUG = D_STATS
68,69c74,76
< CES = localhost, vm29.tier2.hep.manchester.ac.uk
< CGROUP_MEMORY_LIMIT_POLICY = soft
---
> CERTIFICATE_MAPFILE_ASSUME_HASH_KEYS = false
> CES = vm3.in.tier2.hep.manchester.ac.uk
> CGROUP_MEMORY_LIMIT_POLICY = none
72a80
> CHOWN_JOB_SPOOL_FILES = False
96c104
< CMS = localhost, vm29.tier2.hep.manchester.ac.uk
---
> CMS = vm3.in.tier2.hep.manchester.ac.uk
106c114
< COLLECTOR_HOST = localhost, vm29.tier2.hep.manchester.ac.uk
---
> COLLECTOR_HOST = vm3.in.tier2.hep.manchester.ac.uk
124c132
< CONDOR_ADMIN = ops@xxxxxxxxxxxxxxxxxxxxxxxxxx
---
> CONDOR_ADMIN = aforti@xxxxxxxxxxxxxxxxxxxxxxxxxx
131a140,141
> CONDOR_Q_DASH_BATCH_IS_DEFAULT = true
> CONDOR_Q_ONLY_MY_JOBS = true
134c144
< CONDOR_VERSION = 8.4.11
---
> CONDOR_VERSION = 8.6.1
172c182
< DAGMAN_ALWAYS_RUN_POST = true
---
> DAGMAN_ALWAYS_RUN_POST = false
200a211
> DAGMAN_REMOVE_NODE_JOBS = true
206a218
> DAGMAN_SUPPRESS_JOB_LOGS = false
207a220
> DAGMAN_USE_SHARED_PORT = false
224c237
< DEFAULT_DOMAIN_NAME = tier2.hep.manchester.ac.uk
---
> DEFAULT_DOMAIN_NAME = in.tier2.hep.manchester.ac.uk
226a240
> DEFAULT_MASTER_SHUTDOWN_SCRIPT = 
252,255c266,269
< DETECTED_CORES = 4
< DETECTED_CPUS = 4
< DETECTED_MEMORY = 7870
< DETECTED_PHYSICAL_CPUS = 4
---
> DETECTED_CORES = 2
> DETECTED_CPUS = 2
> DETECTED_MEMORY = 1876
> DETECTED_PHYSICAL_CPUS = 2
262d275
< DOMAIN = input_domain
264a278
> EC2_GAHP_DEBUG = D_PID
279,280c293,294
< ENABLE_IPV4 = true
< ENABLE_IPV6 = false
---
> ENABLE_IPV4 = auto
> ENABLE_IPV6 = auto
295d308
< EVENT_LIST = 
305,315d317
< EVENTD_ADMIN_MEGABITS_SEC = 
< EVENTD_CAPACITY_INFO = 
< EVENTD_INTERVAL = 900
< EVENTD_MAX_PREPARATION = 0
< EVENTD_MIN_RESCHEDULE_INTERVAL = 60
< EVENTD_ROUTING_INFO = 
< EVENTD_SHUTDOWN_CLEANUP_INTERVAL = 3600
< EVENTD_SHUTDOWN_CONSTRAINT = 
< EVENTD_SHUTDOWN_SLOW_START_INTERVAL = 0
< EVENTD_SHUTDOWN_TIME = 
< EVENTD_SIMULATE_SHUTDOWNS = 
323c325
< FILESYSTEM_DOMAIN = vm29.tier2.hep.manchester.ac.uk
---
> FILESYSTEM_DOMAIN = vm3.in.tier2.hep.manchester.ac.uk
332c334
< FULL_HOSTNAME = vm29.tier2.hep.manchester.ac.uk
---
> FULL_HOSTNAME = vm3.in.tier2.hep.manchester.ac.uk
411a414
> HAD_ARGS = -sock $(HAD_SOCKET_NAME)
415a419
> HAD_SOCKET_NAME = $(LOCALNAME:had)
419a424
> HAD_USE_SHARED_PORT = false
435c440
< HISTORY_HELPER = $(LIBEXEC)/condor_history_helper
---
> HISTORY_HELPER = $(BIN)/condor_history
438d442
< HOST = input_host
446c450
< HOSTNAME = vm29
---
> HOSTNAME = vm3
447a452,454
> IGNORE_ATTEMPTS_TO_SET_SECURE_JOB_ATTRS = true
> IGNORE_DNS_PROTOCOL_PREFERENCE = $(PREFER_IPV4)
> IGNORE_LEAF_OOM = true
448a456,458
> IGNORE_TARGET_PROTOCOL_PREFERENCE = $(PREFER_IPV4)
> IMMUTABLE_JOB_ATTRS = 
> IN_HIGHPORT = 
453,454c463,466
< IP = input
< IP_ADDRESS = 195.194.105.179
---
> IP_ADDRESS = 195.194.109.190
> IP_ADDRESS_IS_IPV6 = false
> IPV4_ADDRESS = 195.194.109.190
> IPV6_ADDRESS = 2001:630:22:1004:5054:ff:fe66:fe7f
490a503
> JOB_SPOOL_PERMISSIONS = user
531a545
> LOCALNAME = TOOL
535a550
> LOG_TO_SYSLOG = false
536a552
> LOWPORT = 
569a586
> MAX_DAGMAN_LOG = 0
602a620
> MAX_RUNNING_SCHEDULER_JOBS_PER_OWNER = 
643a662
> NEGOTIATOR_CROSS_SLOT_PRIOS = false
658a678,679
> NEGOTIATOR_PREFETCH_REQUESTS = false
> NEGOTIATOR_PREFETCH_REQUESTS_MAX_TIME = 120
681c702
< NUM_CPUS = 2
---
> NUM_CPUS = 8
684a706,707
> OPENMPI_EXCLUDE_NETWORK_INTERFACES = docker0,virbr0
> OPENMPI_INSTALL_PATH = /usr/lib64/openmpi
688c711
< OPSYSLONGNAME = Scientific Linux release 6.7 (Carbon)
---
> OPSYSLONGNAME = Scientific Linux release 6.8 (Carbon)
692c715,716
< OPSYSVER = 607
---
> OPSYSVER = 608
> OUT_HIGHPORT = 
708c732
< PID = 180982
---
> PID = 2752927
718c742
< PPID = 171843
---
> PPID = 2743495
725a750,751
> PREFER_IPV4 = true
> PREFER_OUTBOUND_IPV4 = $(PREFER_IPV4)
735a762
> PROTECTED_JOB_ATTRS = 
780a808
> REPLICATION_ARGS = -sock $(REPLICATION_SOCKET_NAME)
784a813,814
> REPLICATION_SOCKET_NAME = $(LOCALNAME:replication)
> REPLICATION_USE_SHARED_PORT = $(HAD_USE_SHARED_PORT)
860a891
> SECURE_JOB_ATTRS = 
862,866c893
< SETTABLE_ATTRS_ADMINISTRATOR = 
< SETTABLE_ATTRS_ADVERTISE_MASTER = 
< SETTABLE_ATTRS_ADVERTISE_SCHEDD = 
< SETTABLE_ATTRS_ADVERTISE_STARTD = 
< SETTABLE_ATTRS_CLIENT = 
---
> SETTABLE_ATTRS_ADMINSTRATOR = 
868,874d894
< SETTABLE_ATTRS_DAEMON = 
< SETTABLE_ATTRS_DEFAULT = 
< SETTABLE_ATTRS_NEGOTIATOR = 
< SETTABLE_ATTRS_OWNER = 
< SETTABLE_ATTRS_READ = 
< SETTABLE_ATTRS_SOAP = 
< SETTABLE_ATTRS_WRITE = 
889a910
> SHADOW_STATS_LOG = $(LOG)/XferStatsLog
902c923,925
< SINFUL = input
---
> SINGULARITY = /usr/bin/singularity
> SINGULARITY_IMAGE_EXPR = SingularityImage
> SINGULARITY_JOB = false
942c965
< STARTD_DEBUG = 
---
> STARTD_DEBUG = D_COMMAND D_FULLDEBUG
966c989
< STARTER_DEBUG = D_PID
---
> STARTER_DEBUG = D_PID 
978a1002
> STARTER_STATS_LOG = $(LOG)/XferStatsLog
994d1017
< STRING = input
996a1020
> SUBMIT_PUBLISH_WINDOWS_OSVERSIONINFO = false
1002a1027
> SYSTEM_IMMUTABLE_JOB_ATTRS = Owner ClusterId ProcId TotalSubmitProcs MyType TargetType
1005a1031,1032
> SYSTEM_PROTECTED_JOB_ATTRS = 
> SYSTEM_SECURE_JOB_ATTRS = x509userProxySubject x509UserProxyEmail x509UserProxyVOName x509UserProxyFirstFQAN x509UserProxyFQAN
1043c1070
< UID_DOMAIN = tier2.hep.manchester.ac.uk
---
> UID_DOMAIN = in.tier2.hep.manchester.ac.uk
1049a1077
> UPDATE_SPREAD_TIME = $(UPDATE_COLLECTOR_WITH_TCP:0) ? 0 : 8
1062c1090
< USE_SHARED_PORT = false
---
> USE_SHARED_PORT = true
1071,1072c1099,1100
< UTSNAME_NODENAME = vm29.tier2.hep.manchester.ac.uk
< UTSNAME_RELEASE = 2.6.32-642.13.1.el6.x86_64
---
> UTSNAME_NODENAME = vm3.in.tier2.hep.manchester.ac.uk
> UTSNAME_RELEASE = 2.6.32-642.15.1.el6.x86_64
1074c1102
< UTSNAME_VERSION = #1 SMP Tue Jan 10 11:22:50 CST 2017
---
> UTSNAME_VERSION = #1 SMP Thu Feb 23 11:19:57 CST 2017
1126c1154
< WNS = localhost, vm29.tier2.hep.manchester.ac.uk, vm32.tier2.hep.manchester.ac.uk
---
> WNS = vm3.in.tier2.hep.manchester.ac.uk, vm32.in.tier2.hep.manchester.ac.uk, vm33.in.tier2.hep.manchester.ac.uk