Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Recurring problem with job starts

Date: Wed, 2 Sep 2020 17:05:52 -0500
From: Mark Coatsworth <coatsworth@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Recurring problem with job starts

Hi Steffen, a few things to think about here.

Since your condor_starter is able to create the
/var/lib/condor/execute/dir_34972 directory, this implies it's not a
higher level permission or write access problem.

Are these jobs failing consistently in your pool? Or does the problem
seem isolated to a subset of misbehaving nodes?

You mentioned the execute nodes have other
/var/lib/condor/execute/dir_NNNNN folders. Can you let us know what
the ownership and permissions look like on these folders, and the
files inside them?

I'm wondering if it's possible your target execute directory and files
already exist, and they have ownership or permissions that do not
allow us to overwrite files. Can you verify the folders mentioned in
the error messages do not exist?

Is /var mounting from the local disk, or from a shared file system?
(I'm assuming not a shared file system! But it's worth making sure)

Lastly, could you show us the job submit file you're using? This might
have some clues.

Mark


On Wed, Sep 2, 2020 at 2:56 AM Steffen Grunewald
<steffen.grunewald@xxxxxxxxxx> wrote:
>
> Good morning/afternoon/whatever,
>
> starting two days ago, I'm getting reports of failed job starts. Jobs affected
> get held with the following reason:
>
> (12)=STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied
>
> on multiple nodes. Removing them from the pool causes the disease to spread to other nodes as well.
>
> Checking the starter logs on the node, I see
>
> StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
> StarterLog.slot1_4:20-09-02_09:39:36  ** condor_starter (CONDOR_STARTER) STARTING UP
> StarterLog.slot1_4:20-09-02_09:39:36  ** /usr/sbin/condor_starter
> StarterLog.slot1_4:20-09-02_09:39:36  ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
> StarterLog.slot1_4:20-09-02_09:39:36  ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
> StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorVersion: 8.8.3 May 29 2019 BuildID: Debian-8.8.3-1+deb9u0 PackageID: 8.8.3-1+deb9u0 Debian-8.8.3-1+deb9u0 $
> StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorPlatform: X86_64-Debian_9 $
> StarterLog.slot1_4:20-09-02_09:39:36  ** PID = 34972
> StarterLog.slot1_4:20-09-02_09:39:36  ** Log last touched 9/1 13:46:10
> StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
> StarterLog.slot1_4:20-09-02_09:39:36  Using config source: /etc/condor/condor_config
> StarterLog.slot1_4:20-09-02_09:39:36  Using local config sources:
> StarterLog.slot1_4:20-09-02_09:39:36     /etc/default/condor_config|
> StarterLog.slot1_4:20-09-02_09:39:36  config Macros = 331, Sorted = 330, StringBytes = 8497, TablesBytes = 11964
> StarterLog.slot1_4:20-09-02_09:39:36  CLASSAD_CACHING is OFF
> StarterLog.slot1_4:20-09-02_09:39:36  Daemon Log is logging: D_ALWAYS D_ERROR
> StarterLog.slot1_4:20-09-02_09:39:36  Daemoncore: Listening at <10.150.1.11:44421> on TCP (ReliSock) and UDP (SafeSock).
> StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
> StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: private command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
> StarterLog.slot1_4:20-09-02_09:39:36  Communicating with shadow <10.150.100.102:16481?addrs=10.150.100.102-16481&noUDP>
> StarterLog.slot1_4:20-09-02_09:39:36  Submitting machine is "hypatia2.hypatia.local"
> StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job name in starter
> StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job iwd in starter
> StarterLog.slot1_4:20-09-02_09:39:36  Chirp config summary: IO false, Updates false, Delayed updates true.
> StarterLog.slot1_4:20-09-02_09:39:36  Initialized IO Proxy.
> StarterLog.slot1_4:20-09-02_09:39:36  Done setting resource limits
> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.machine.ad, errno = 13: Permission denied.
> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7358 bytes of file transmission
> StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.machine.ad: (errno 13) Permission denied
> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.job.ad, errno = 13: Permission denied.
> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7788 bytes of file transmission
> StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied
> StarterLog.slot1_4:20-09-02_09:39:37  File transfer failed (status=0).
> StarterLog.slot1_4:20-09-02_09:39:37  ERROR "Failed to transfer files" at line 2468 in file /build/condor-8.8.3/src/condor_starter.V6.1/jic_shadow.cpp
> StarterLog.slot1_4:20-09-02_09:39:37  ShutdownFast all jobs.
> StarterLog.slot1_4:20-09-02_09:39:37  condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.150.100.102:24377>.
> StarterLog.slot1_4:20-09-02_09:39:37  IO: Failed to read packet header
> StarterLog.slot1_4:20-09-02_09:39:37  Lost connection to shadow, waiting 2400 secs for reconnect
> StarterLog.slot1_4:20-09-02_09:39:37  All jobs have exited... starter exiting
> StarterLog.slot1_4:20-09-02_09:39:37  **** condor_starter (condor_STARTER) pid 34972 EXITING WITH STATUS 0
>
> /var/lib/condor/execute is 0755 condor:condor on the execute node and bears the above timestamp;
> it contains other active dir_* entries.
> /var is mounted read-only and almost empty.
> On the submit node, /var/lib/condor/execute is empty, and apparently always has been.
>
> Any suggestion how to debug this further?
>
> Thanks,
>  Steffen
>
> --
> Steffen Grunewald, Cluster Administrator
> Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
> Am MÃhlenberg 1 * D-14476 Potsdam-Golm * Germany
> ~~~
> Fon: +49-331-567 7274
> Mail: steffen.grunewald(at)aei.mpg.de
> ~~~
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



--
Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison

Follow-Ups:
- Re: [HTCondor-users] Recurring problem with job starts
  - From: Steffen Grunewald

References:
- [HTCondor-users] Recurring problem with job starts
  - From: Steffen Grunewald

Prev by Date: [HTCondor-users] Recurring problem with job starts
Next by Date: Re: [HTCondor-users] Recurring problem with job starts
Previous by thread: [HTCondor-users] Recurring problem with job starts
Next by thread: Re: [HTCondor-users] Recurring problem with job starts
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Recurring problem with job starts