[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Recurring problem with job starts



Hi Mark, all,

please find my comments below...

On Wed, 2020-09-02 at 17:05:52 -0500, Mark Coatsworth wrote:
> Hi Steffen, a few things to think about here.
> 
> Since your condor_starter is able to create the
> /var/lib/condor/execute/dir_34972 directory, this implies it's not a
> higher level permission or write access problem.

Indeed - and since "all users are equal" with respect to ownership/permission
settings, I ran a test to verify that there would be no "black hole" nodes -
and found none.

> Are these jobs failing consistently in your pool? Or does the problem
> seem isolated to a subset of misbehaving nodes?

There is/was only a small subset of nodes provisding enough resources,
so the effect looked isolated, but it isn't. At least not in terms of
Condor or OS setup.

What I found is that all affected jobs are DAG nodes that had been running
before, and it rather looks like their shadows have a problem.
In the worst case I will have to tell the (single) user affected to condor_rm
the held jobs and go for rescue DAGs. Since this involves extra work for him
I'd like to find out more and whether I still can do something.

> You mentioned the execute nodes have other
> /var/lib/condor/execute/dir_NNNNN folders. Can you let us know what
> the ownership and permissions look like on these folders, and the
> files inside them?

NNNNN doesn't seem to be the PID of the started, nor related to the job id.
Is there a translation table somewhere? It doesn't seem that NNNNN stays
constant over multiple restart attempts, thus I would not expect any
collisions to be persistent while changing compute nodes.

> I'm wondering if it's possible your target execute directory and files
> already exist, and they have ownership or permissions that do not
> allow us to overwrite files. Can you verify the folders mentioned in
> the error messages do not exist?

That would not be able to explain why the effect travels across the cluster.
No other jobs have shown such a behaviour since; it's just a set of 17 jobs
that got harmed by a network failure (a switch disconnected a whole rack
from the pool).

> Is /var mounting from the local disk, or from a shared file system?
> (I'm assuming not a shared file system! But it's worth making sure)

/var is local everywhere. I've been thinking about mounting /var/lib/condor(/spool)
from a file server for the headnodes only, to more easily preserve their
histories, but didn't get that far yet.

> Lastly, could you show us the job submit file you're using? This might
> have some clues.

Since there are multiple DAGs involved I've got to ask for them...

- S

> 
> Mark
> 
> 
> On Wed, Sep 2, 2020 at 2:56 AM Steffen Grunewald
> <steffen.grunewald@xxxxxxxxxx> wrote:
> >
> > Good morning/afternoon/whatever,
> >
> > starting two days ago, I'm getting reports of failed job starts. Jobs affected
> > get held with the following reason:
> >
> > (12)=STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied
> >
> > on multiple nodes. Removing them from the pool causes the disease to spread to other nodes as well.
> >
> > Checking the starter logs on the node, I see
> >
> > StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
> > StarterLog.slot1_4:20-09-02_09:39:36  ** condor_starter (CONDOR_STARTER) STARTING UP
> > StarterLog.slot1_4:20-09-02_09:39:36  ** /usr/sbin/condor_starter
> > StarterLog.slot1_4:20-09-02_09:39:36  ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
> > StarterLog.slot1_4:20-09-02_09:39:36  ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
> > StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorVersion: 8.8.3 May 29 2019 BuildID: Debian-8.8.3-1+deb9u0 PackageID: 8.8.3-1+deb9u0 Debian-8.8.3-1+deb9u0 $
> > StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorPlatform: X86_64-Debian_9 $
> > StarterLog.slot1_4:20-09-02_09:39:36  ** PID = 34972
> > StarterLog.slot1_4:20-09-02_09:39:36  ** Log last touched 9/1 13:46:10
> > StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
> > StarterLog.slot1_4:20-09-02_09:39:36  Using config source: /etc/condor/condor_config
> > StarterLog.slot1_4:20-09-02_09:39:36  Using local config sources:
> > StarterLog.slot1_4:20-09-02_09:39:36     /etc/default/condor_config|
> > StarterLog.slot1_4:20-09-02_09:39:36  config Macros = 331, Sorted = 330, StringBytes = 8497, TablesBytes = 11964
> > StarterLog.slot1_4:20-09-02_09:39:36  CLASSAD_CACHING is OFF
> > StarterLog.slot1_4:20-09-02_09:39:36  Daemon Log is logging: D_ALWAYS D_ERROR
> > StarterLog.slot1_4:20-09-02_09:39:36  Daemoncore: Listening at <10.150.1.11:44421> on TCP (ReliSock) and UDP (SafeSock).
> > StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
> > StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: private command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
> > StarterLog.slot1_4:20-09-02_09:39:36  Communicating with shadow <10.150.100.102:16481?addrs=10.150.100.102-16481&noUDP>
> > StarterLog.slot1_4:20-09-02_09:39:36  Submitting machine is "hypatia2.hypatia.local"
> > StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job name in starter
> > StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job iwd in starter
> > StarterLog.slot1_4:20-09-02_09:39:36  Chirp config summary: IO false, Updates false, Delayed updates true.
> > StarterLog.slot1_4:20-09-02_09:39:36  Initialized IO Proxy.
> > StarterLog.slot1_4:20-09-02_09:39:36  Done setting resource limits
> > StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.machine.ad, errno = 13: Permission denied.
> > StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7358 bytes of file transmission
> > StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.machine.ad: (errno 13) Permission denied
> > StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.job.ad, errno = 13: Permission denied.
> > StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7788 bytes of file transmission
> > StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied
> > StarterLog.slot1_4:20-09-02_09:39:37  File transfer failed (status=0).
> > StarterLog.slot1_4:20-09-02_09:39:37  ERROR "Failed to transfer files" at line 2468 in file /build/condor-8.8.3/src/condor_starter.V6.1/jic_shadow.cpp
> > StarterLog.slot1_4:20-09-02_09:39:37  ShutdownFast all jobs.
> > StarterLog.slot1_4:20-09-02_09:39:37  condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.150.100.102:24377>.
> > StarterLog.slot1_4:20-09-02_09:39:37  IO: Failed to read packet header
> > StarterLog.slot1_4:20-09-02_09:39:37  Lost connection to shadow, waiting 2400 secs for reconnect
> > StarterLog.slot1_4:20-09-02_09:39:37  All jobs have exited... starter exiting
> > StarterLog.slot1_4:20-09-02_09:39:37  **** condor_starter (condor_STARTER) pid 34972 EXITING WITH STATUS 0
> >
> > /var/lib/condor/execute is 0755 condor:condor on the execute node and bears the above timestamp;
> > it contains other active dir_* entries.
> > /var is mounted read-only and almost empty.
> > On the submit node, /var/lib/condor/execute is empty, and apparently always has been.
> >
> > Any suggestion how to debug this further?
> >
> > Thanks,
> >  Steffen
> >
> > --
> > Steffen Grunewald, Cluster Administrator
> > Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
> > Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
> > ~~~
> > Fon: +49-331-567 7274
> > Mail: steffen.grunewald(at)aei.mpg.de
> > ~~~
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> 
> --
> Mark Coatsworth
> Systems Programmer
> Center for High Throughput Computing
> Department of Computer Sciences
> University of Wisconsin-Madison
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~