[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Recurring problem with job starts



Good morning/afternoon/whatever,

starting two days ago, I'm getting reports of failed job starts. Jobs affected
get held with the following reason:

(12)=STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied

on multiple nodes. Removing them from the pool causes the disease to spread to other nodes as well.

Checking the starter logs on the node, I see

StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
StarterLog.slot1_4:20-09-02_09:39:36  ** condor_starter (CONDOR_STARTER) STARTING UP
StarterLog.slot1_4:20-09-02_09:39:36  ** /usr/sbin/condor_starter
StarterLog.slot1_4:20-09-02_09:39:36  ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
StarterLog.slot1_4:20-09-02_09:39:36  ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorVersion: 8.8.3 May 29 2019 BuildID: Debian-8.8.3-1+deb9u0 PackageID: 8.8.3-1+deb9u0 Debian-8.8.3-1+deb9u0 $
StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorPlatform: X86_64-Debian_9 $
StarterLog.slot1_4:20-09-02_09:39:36  ** PID = 34972
StarterLog.slot1_4:20-09-02_09:39:36  ** Log last touched 9/1 13:46:10
StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
StarterLog.slot1_4:20-09-02_09:39:36  Using config source: /etc/condor/condor_config
StarterLog.slot1_4:20-09-02_09:39:36  Using local config sources: 
StarterLog.slot1_4:20-09-02_09:39:36     /etc/default/condor_config|
StarterLog.slot1_4:20-09-02_09:39:36  config Macros = 331, Sorted = 330, StringBytes = 8497, TablesBytes = 11964
StarterLog.slot1_4:20-09-02_09:39:36  CLASSAD_CACHING is OFF
StarterLog.slot1_4:20-09-02_09:39:36  Daemon Log is logging: D_ALWAYS D_ERROR
StarterLog.slot1_4:20-09-02_09:39:36  Daemoncore: Listening at <10.150.1.11:44421> on TCP (ReliSock) and UDP (SafeSock).
StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: private command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
StarterLog.slot1_4:20-09-02_09:39:36  Communicating with shadow <10.150.100.102:16481?addrs=10.150.100.102-16481&noUDP>
StarterLog.slot1_4:20-09-02_09:39:36  Submitting machine is "hypatia2.hypatia.local"
StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job name in starter
StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job iwd in starter
StarterLog.slot1_4:20-09-02_09:39:36  Chirp config summary: IO false, Updates false, Delayed updates true.
StarterLog.slot1_4:20-09-02_09:39:36  Initialized IO Proxy.
StarterLog.slot1_4:20-09-02_09:39:36  Done setting resource limits
StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.machine.ad, errno = 13: Permission denied.
StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7358 bytes of file transmission
StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.machine.ad: (errno 13) Permission denied
StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.job.ad, errno = 13: Permission denied.
StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7788 bytes of file transmission
StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied
StarterLog.slot1_4:20-09-02_09:39:37  File transfer failed (status=0).
StarterLog.slot1_4:20-09-02_09:39:37  ERROR "Failed to transfer files" at line 2468 in file /build/condor-8.8.3/src/condor_starter.V6.1/jic_shadow.cpp
StarterLog.slot1_4:20-09-02_09:39:37  ShutdownFast all jobs.
StarterLog.slot1_4:20-09-02_09:39:37  condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.150.100.102:24377>.
StarterLog.slot1_4:20-09-02_09:39:37  IO: Failed to read packet header
StarterLog.slot1_4:20-09-02_09:39:37  Lost connection to shadow, waiting 2400 secs for reconnect
StarterLog.slot1_4:20-09-02_09:39:37  All jobs have exited... starter exiting
StarterLog.slot1_4:20-09-02_09:39:37  **** condor_starter (condor_STARTER) pid 34972 EXITING WITH STATUS 0

/var/lib/condor/execute is 0755 condor:condor on the execute node and bears the above timestamp;
it contains other active dir_* entries.
/var is mounted read-only and almost empty.
On the submit node, /var/lib/condor/execute is empty, and apparently always has been.

Any suggestion how to debug this further?

Thanks,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~