[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Startd fails on fresh install due to missing execute directory



We’re seeing the same behavior on new NMRbox VMs. (no /var/lib/condor). It hasn’t affected our older one. I’m not sure why. Possibly left over from prior installations?

 

Our shared condor account is in Active Directory, and our Linux (Ubuntu) systems connect via sssd, so it’s doing something other than parsing /etc/passwd.

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Fabrice Bouye <FabriceB@xxxxxxx>
Date: Thursday, October 5, 2023 at 6:43 PM
To: Tim Theisen <tim@xxxxxxxxxxx>, HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Startd fails on fresh install due to missing execute directory

*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

Hello,

Thanks for your reply. This is quite likely what happens as /etc/passwd still contained the following line:

 

condor:x:115:119:HTCondor Daemons,,,:/var/lib/condor:/usr/sbin/nologin

 

After further testing:

 

$ apt-get -y remove --purge htcondor && apt-get -y autoremove --purge && rm -fr /etc/condor

-> deletes /var/lib/condor

-> does not delete existing condor user

 

Condor installation (get_htcondor or apt-get install)

-> will not recreate /var/lib/condor if condor user already defined in /etc/passwd

 

 

From: Tim Theisen <tim@xxxxxxxxxxx>
Sent: Thursday, October 5, 2023 10:57 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Fabrice Bouye <FabriceB@xxxxxxx>
Subject: Re: [HTCondor-users] Startd fails on fresh install due to missing execute directory

 

We test installing HTCondor on Ubuntu and have not seen that problem.

The installation script creates the condor account with a home directory of /var/lib/condor.

Is it possible that the condor account already exists with a different home directory. That would be one possible explanation for the failure.

Let me know. I can add some defensive coding for installations where condor's default home directory has been changed.

...Tim

On 10/4/23 17:50, Fabrice Bouye wrote:

Hello,

I am in the process of trying to update an Ubuntu 20 8.8.x flock to 23.0.  

 

I am not sure if this is new or well-known issue but there appears to be a problem with HTCondor install for 23.0, possibly 10.x and 10.0 too.

I haven’t tried 9.0, 9.x or 23.x. My 8.8.x are old and but I do not remember experiencing this issue when doing fresh 8.8.x installs.

As part of the testing, I tried directly upgrading from 8.8.x, but I also tried fresh installs using both the get_condor script and a manual install.

In both the later cases the previous install was stripped using the command line from the get_condor script: apt-get -y remove --purge htcondor && apt-get -y autoremove --purge && rm -fr /etc/condor (basically what get_condor suggests to do to remove older installs).

 

On Ubuntu 20, this issue seems to appear all the time when installing using get_condor (minicondor or other roles) or manually with apt-get install htcondor or apt-get install minicondor

When installing on a condor-free system, while /etc/condor is recreated,  it seems that the root of the execute directory /var/lib/condor is not recreated by the installation process.

This issue does not occur when updating from 8.8.x of course as the /var/lib/condor directory already exists.

 

This path is defined in the default /etc/condor/condor_config that is deployed by the installation:

LOCAL_DIR = /var

 

[…]

 

EXECUTE = $(LOCAL_DIR)/lib/condor/execute

 

This issue makes stard fail in loop on execute machines (see logs below) when starting condor, probably due to the fact that the condor user cannot create a directory in /var/lib.

The fix for this is for root to create the missing directory /var/lib/condor, startd will recreate the execute sub-directory belonging to the condor user next time condor is restarted.

 

$ cd /var/lib
$ mkdir condor

$ chmod 755 condor

 

In /var/log/condor/MasterLog:

10/05/23 10:09:11 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 69386

10/05/23 10:09:11 Daemons::StartAllDaemons all daemons were started

10/05/23 10:09:14 The STARTD (pid 69386) exited with status 4

10/05/23 10:09:14 Sending obituary for "/usr/sbin/condor_startd"

10/05/23 10:09:14 restarting /usr/sbin/condor_startd in 10 seconds

[…] loops from here

 

In /var/log/condor/StartLog:

10/05/23 10:10:42 ******************************************************

10/05/23 10:10:42 ** condor_startd (CONDOR_STARTD) STARTING UP

10/05/23 10:10:42 ** /usr/sbin/condor_startd

10/05/23 10:10:42 ** SubsystemInfo: name=STARTD type=STARTD(6) class=DAEMON(1)

10/05/23 10:10:42 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON

10/05/23 10:10:42 ** $CondorVersion: 23.0.0 2023-09-29 BuildID: 678686 PackageID: 23.0.0-1 $

10/05/23 10:10:42 ** $CondorPlatform: X86_64-Ubuntu_20.04 $

10/05/23 10:10:42 ** PID = 70350

10/05/23 10:10:42 ** Log last touched 10/5 10:10:17

10/05/23 10:10:42 ******************************************************

10/05/23 10:10:42 Using config source: /etc/condor/condor_config

10/05/23 10:10:42 Using local config sources:

10/05/23 10:10:42    /etc/condor/config.d/00-htcondor-9.0.config

10/05/23 10:10:42    /etc/condor/condor_config.local

10/05/23 10:10:42 config Macros = 93, Sorted = 93, StringBytes = 2628, TablesBytes = 3404

10/05/23 10:10:42 CLASSAD_CACHING is ENABLED

10/05/23 10:10:42 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS

10/05/23 10:10:42 SharedPortEndpoint: waiting for connections to named socket startd_69340_0d0f

10/05/23 10:10:42 DaemonCore: command socket at <192.168.8.246:9618?addrs=192.168.8.246-9618&alias=suvofpcand20.corp.spc.int&noUDP&sock=startd_69340_0d0f>

10/05/23 10:10:42 DaemonCore: private command socket at <192.168.8.246:9618?addrs=192.168.8.246-9618&alias=suvofpcand20.corp.spc.int&noUDP&sock=startd_69340_0d0f>

10/05/23 10:10:45 VM universe will be tested to check if it is available

10/05/23 10:10:45 History file rotation is enabled.

10/05/23 10:10:45   Maximum history file size is: 20971520 bytes

10/05/23 10:10:45   Number of rotated history files is: 2

10/05/23 10:10:45 Startd will not enforce disk limits via logical volume management.

10/05/23 10:10:45 Failed to stat /var/lib/condor/execute: (errno 2) No such file or directory

10/05/23 10:10:45 ERROR "Error accessing execute directory /var/lib/condor/execute specified in the configuration setting SLOT1_EXECUTE: (errno=2) No such file or directory" at line 78 in file /var/lib/condor/execute/slot1/dir_3182108/userdir/build-I2xw6a/condor-23.0.0/src/condor_startd.V6/slot_builder.cpp




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
-- 
Tim Theisen (he, him, his)
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736