[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] permission denied for .update.ad on local file system



Hi,

in our condor set-up we start condor with the usual systemd service file
as root which subsequently drops privileges to become the local user
'condor'. We also move this user's directory locations to /local/condor
and mimic the directories needed usually found under /var/lib/condor.

However, it may be we are missing something as on a host we currently
restart peacefully, we see error messages like:

Failed to open '.update.ad' to read update ad: Permission denied (13).

Failed to
rename(/local/condor/execute/dir_17864/core,/local/condor/execute/dir_17864/core.402177.0):
errno 13\
(Permission denied)

and the like

(full Starterlog attached for easier reading).

Is this something we should worry about or "just noise"?

At first glance, permissions look ok, e.g.

ls -ld /local/condor/*
drwxr-xr-x   2 condor condor   10 Feb 13 15:48 /local/condor/ViewHist
drwxr-xr-x   2 condor condor   10 Feb 13 15:48 /local/condor/ckpt
drwxr-xr-x   2 condor condor   10 Mar 16 09:28 /local/condor/cred_dir
drwxr-xr-x 119 condor condor 4096 May 11 09:07 /local/condor/execute
drwxr-xr-x   3 condor condor   40 Mar 16 09:28 /local/condor/spool

and
ls -ld /local/condor/execute/*|head -n 5
drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_1082
drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_11017
drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_1149
drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_11880
drwx------ 4 condor condor 288 May 11 09:07 /local/condor/execute/dir_12025

Obviously, as the job is already gone, /local/condor/execute/dir_17864
does not exist anymore.

Anyone with a hint what is wrong here?

Cheers

carsten
-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185
05/11/20 06:09:47 (pid:17864) ******************************************************
05/11/20 06:09:47 (pid:17864) ** condor_starter (CONDOR_STARTER) STARTING UP
05/11/20 06:09:47 (pid:17864) ** /usr/sbin/condor_starter
05/11/20 06:09:47 (pid:17864) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
05/11/20 06:09:47 (pid:17864) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
05/11/20 06:09:47 (pid:17864) ** $CondorVersion: 8.8.8 Mar 20 2020 BuildID: Debian-8.8.8-1 PackageID: 8.8.8-1 Debian-8.8.8-1 $
05/11/20 06:09:47 (pid:17864) ** $CondorPlatform: X86_64-Debian_10 $
05/11/20 06:09:47 (pid:17864) ** PID = 17864
05/11/20 06:09:47 (pid:17864) ** Log last touched 5/11 06:09:18
05/11/20 06:09:47 (pid:17864) ******************************************************
05/11/20 06:09:47 (pid:17864) Using config source: /etc/condor/condor_config
05/11/20 06:09:47 (pid:17864) Using local config sources:
05/11/20 06:09:47 (pid:17864)    /etc/condor/config.d/01_generic
05/11/20 06:09:47 (pid:17864)    /etc/condor/config.d/10_EXECUTE
05/11/20 06:09:47 (pid:17864)    /etc/condor/condor_config.local
05/11/20 06:09:47 (pid:17864) config Macros = 100, Sorted = 99, StringBytes = 2506, TablesBytes = 3656
05/11/20 06:09:47 (pid:17864) CLASSAD_CACHING is OFF
05/11/20 06:09:47 (pid:17864) Daemon Log is logging: D_ALWAYS D_ERROR
05/11/20 06:09:47 (pid:17864) SharedPortEndpoint: waiting for connections to named socket 50180_1972_1711
05/11/20 06:09:47 (pid:17864) DaemonCore: command socket at <10.10.37.5:9618?addrs=10.10.37.5-9618&noUDP&sock=50180_1972_1711>
05/11/20 06:09:47 (pid:17864) DaemonCore: private command socket at <10.10.37.5:9618?addrs=10.10.37.5-9618&noUDP&sock=50180_1972_1711>
05/11/20 06:09:47 (pid:17864) Communicating with shadow <10.20.30.17:9618?addrs=10.20.30.17-9618&noUDP&sock=3248571_e61f_507703>
05/11/20 06:09:47 (pid:17864) Submitting machine is "condor2.atlas.local"
05/11/20 06:09:47 (pid:17864) setting the orig job name in starter
05/11/20 06:09:47 (pid:17864) setting the orig job iwd in starter
05/11/20 06:09:47 (pid:17864) Job has WantIOProxy=true
05/11/20 06:09:47 (pid:17864) Chirp config summary: IO true, Updates true, Delayed updates true.
05/11/20 06:09:47 (pid:17864) Initialized IO Proxy.
05/11/20 06:09:47 (pid:17864) Done setting resource limits
05/11/20 06:09:47 (pid:17864) File transfer completed successfully.
05/11/20 06:09:47 (pid:17864) Job 402177.0 set to execute immediately
05/11/20 06:09:47 (pid:17864) Starting a VANILLA universe job with ID: 402177.0
05/11/20 06:09:47 (pid:17864) Current mount, /tmp, is shared.
05/11/20 06:09:47 (pid:17864) Current mount, /var, is shared.
05/11/20 06:09:47 (pid:17864) IWD: /work/USER/[......]
05/11/20 06:09:47 (pid:17864) Output file: /local/condor/execute/dir_17864/_condor_stdout
05/11/20 06:09:47 (pid:17864) Error file: /local/condor/execute/dir_17864/_condor_stderr
05/11/20 06:09:47 (pid:17864) Renice expr "0" evaluated to 0
05/11/20 06:09:47 (pid:17864) Running job as user USER
05/11/20 06:09:47 (pid:17864) About to exec /usr/bin/../bin/pegasus-kickstart [long command line deleted]
05/11/20 06:09:47 (pid:17864) Create_Process succeeded, pid=17868
05/11/20 08:50:12 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13).
05/11/20 08:50:12 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13).
05/11/20 08:53:17 (pid:17864) Process exited, pid=17868, status=0
05/11/20 08:53:17 (pid:17864) Failed to rename(/local/condor/execute/dir_17864/core.17868,/local/condor/execute/dir_17864/core.402177.0): errno 13 (Permission denied)
05/11/20 08:53:17 (pid:17864) Failed to rename(/local/condor/execute/dir_17864/core,/local/condor/execute/dir_17864/core.402177.0): errno 13 (Permission denied)
05/11/20 08:53:17 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13).
05/11/20 08:53:17 (pid:17864) ReliSock: put_file: Failed to open file /local/condor/execute/dir_17864/_condor_stdout, errno = 13.
05/11/20 08:53:17 (pid:17864) ReliSock: put_file: Failed to open file /local/condor/execute/dir_17864/_condor_stderr, errno = 13.
05/11/20 08:53:17 (pid:17864) DoUpload: (Condor error code 13, subcode 13) STARTER at 10.10.37.5 failed to send file(s) to <10.20.30.17:9618>: error reading from /local/condor/execute/dir_17864/_condor_stdout: (errno 13) Permission denied; SHADOW failed to receive file(s) from <10.10.37.5:43947>
05/11/20 08:53:17 (pid:17864) JICShadow::notifyJobTermination(): Sending mock terminate event.
05/11/20 08:53:17 (pid:17864) JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt
05/11/20 08:53:17 (pid:17864) Returning from CStarter::JobReaper()
05/11/20 08:53:17 (pid:17864) Connection to shadow may be lost, will test by sending whoami request.
05/11/20 08:53:17 (pid:17864) condor_write(): Socket closed when trying to write 21 bytes to <10.20.30.17:29033>, fd is 9
05/11/20 08:53:17 (pid:17864) Buf::write(): condor_write() failed
05/11/20 08:53:17 (pid:17864) i/o error result is 0, errno is 0
05/11/20 08:53:17 (pid:17864) Lost connection to shadow, waiting 2400 secs for reconnect
05/11/20 08:53:17 (pid:17864) Got SIGQUIT.  Performing fast shutdown.
05/11/20 08:53:17 (pid:17864) ShutdownFast all jobs.
05/11/20 08:53:17 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13).
05/11/20 08:53:17 (pid:17864) Failed to open '.update.ad' to read update ad: Permission denied (13).
05/11/20 08:53:17 (pid:17864) Failed to send job exit status to shadow
05/11/20 08:53:17 (pid:17864) All jobs have exited... starter exiting
05/11/20 08:53:17 (pid:17864) **** condor_starter (condor_STARTER) pid 17864 EXITING WITH STATUS 0

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature