[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Recurring problem with job starts



On Thu, 2020-09-03 at 10:17:30 +0200, Thomas Hartmann wrote:
> Hi Steffen,
> 
> just guessing, but have you checked the number of open file handles?

Good point, but since this happened after a network failure, with a largely
reduced number of jobs, it's rather improbable, isn't it?

Everything was running before, then the network broke, causing the jobs to
be held - and now they cannot be released again.

- S

> 
> Cheers,
>   Thomas
> 
> 
> On 03/09/2020 09.38, Steffen Grunewald wrote:
> > Hi Mark, all,
> > 
> > please find my comments below...
> > 
> > On Wed, 2020-09-02 at 17:05:52 -0500, Mark Coatsworth wrote:
> >> Hi Steffen, a few things to think about here.
> >>
> >> Since your condor_starter is able to create the
> >> /var/lib/condor/execute/dir_34972 directory, this implies it's not a
> >> higher level permission or write access problem.
> > 
> > Indeed - and since "all users are equal" with respect to ownership/permission
> > settings, I ran a test to verify that there would be no "black hole" nodes -
> > and found none.
> > 
> >> Are these jobs failing consistently in your pool? Or does the problem
> >> seem isolated to a subset of misbehaving nodes?
> > 
> > There is/was only a small subset of nodes provisding enough resources,
> > so the effect looked isolated, but it isn't. At least not in terms of
> > Condor or OS setup.
> > 
> > What I found is that all affected jobs are DAG nodes that had been running
> > before, and it rather looks like their shadows have a problem.
> > In the worst case I will have to tell the (single) user affected to condor_rm
> > the held jobs and go for rescue DAGs. Since this involves extra work for him
> > I'd like to find out more and whether I still can do something.
> > 
> >> You mentioned the execute nodes have other
> >> /var/lib/condor/execute/dir_NNNNN folders. Can you let us know what
> >> the ownership and permissions look like on these folders, and the
> >> files inside them?
> > 
> > NNNNN doesn't seem to be the PID of the started, nor related to the job id.
> > Is there a translation table somewhere? It doesn't seem that NNNNN stays
> > constant over multiple restart attempts, thus I would not expect any
> > collisions to be persistent while changing compute nodes.
> > 
> >> I'm wondering if it's possible your target execute directory and files
> >> already exist, and they have ownership or permissions that do not
> >> allow us to overwrite files. Can you verify the folders mentioned in
> >> the error messages do not exist?
> > 
> > That would not be able to explain why the effect travels across the cluster.
> > No other jobs have shown such a behaviour since; it's just a set of 17 jobs
> > that got harmed by a network failure (a switch disconnected a whole rack
> > from the pool).
> > 
> >> Is /var mounting from the local disk, or from a shared file system?
> >> (I'm assuming not a shared file system! But it's worth making sure)
> > 
> > /var is local everywhere. I've been thinking about mounting /var/lib/condor(/spool)
> > from a file server for the headnodes only, to more easily preserve their
> > histories, but didn't get that far yet.
> > 
> >> Lastly, could you show us the job submit file you're using? This might
> >> have some clues.
> > 
> > Since there are multiple DAGs involved I've got to ask for them...
> > 
> > - S
> > 
> >>
> >> Mark
> >>
> >>
> >> On Wed, Sep 2, 2020 at 2:56 AM Steffen Grunewald
> >> <steffen.grunewald@xxxxxxxxxx> wrote:
> >>>
> >>> Good morning/afternoon/whatever,
> >>>
> >>> starting two days ago, I'm getting reports of failed job starts. Jobs affected
> >>> get held with the following reason:
> >>>
> >>> (12)=STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied
> >>>
> >>> on multiple nodes. Removing them from the pool causes the disease to spread to other nodes as well.
> >>>
> >>> Checking the starter logs on the node, I see
> >>>
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ** condor_starter (CONDOR_STARTER) STARTING UP
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ** /usr/sbin/condor_starter
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorVersion: 8.8.3 May 29 2019 BuildID: Debian-8.8.3-1+deb9u0 PackageID: 8.8.3-1+deb9u0 Debian-8.8.3-1+deb9u0 $
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ** $CondorPlatform: X86_64-Debian_9 $
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ** PID = 34972
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ** Log last touched 9/1 13:46:10
> >>> StarterLog.slot1_4:20-09-02_09:39:36  ******************************************************
> >>> StarterLog.slot1_4:20-09-02_09:39:36  Using config source: /etc/condor/condor_config
> >>> StarterLog.slot1_4:20-09-02_09:39:36  Using local config sources:
> >>> StarterLog.slot1_4:20-09-02_09:39:36     /etc/default/condor_config|
> >>> StarterLog.slot1_4:20-09-02_09:39:36  config Macros = 331, Sorted = 330, StringBytes = 8497, TablesBytes = 11964
> >>> StarterLog.slot1_4:20-09-02_09:39:36  CLASSAD_CACHING is OFF
> >>> StarterLog.slot1_4:20-09-02_09:39:36  Daemon Log is logging: D_ALWAYS D_ERROR
> >>> StarterLog.slot1_4:20-09-02_09:39:36  Daemoncore: Listening at <10.150.1.11:44421> on TCP (ReliSock) and UDP (SafeSock).
> >>> StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
> >>> StarterLog.slot1_4:20-09-02_09:39:36  DaemonCore: private command socket at <10.150.1.11:44421?addrs=10.150.1.11-44421>
> >>> StarterLog.slot1_4:20-09-02_09:39:36  Communicating with shadow <10.150.100.102:16481?addrs=10.150.100.102-16481&noUDP>
> >>> StarterLog.slot1_4:20-09-02_09:39:36  Submitting machine is "hypatia2.hypatia.local"
> >>> StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job name in starter
> >>> StarterLog.slot1_4:20-09-02_09:39:36  setting the orig job iwd in starter
> >>> StarterLog.slot1_4:20-09-02_09:39:36  Chirp config summary: IO false, Updates false, Delayed updates true.
> >>> StarterLog.slot1_4:20-09-02_09:39:36  Initialized IO Proxy.
> >>> StarterLog.slot1_4:20-09-02_09:39:36  Done setting resource limits
> >>> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.machine.ad, errno = 13: Permission denied.
> >>> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7358 bytes of file transmission
> >>> StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.machine.ad: (errno 13) Permission denied
> >>> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): Failed to open file /var/lib/condor/execute/dir_34972/.job.ad, errno = 13: Permission denied.
> >>> StarterLog.slot1_4:20-09-02_09:39:37  get_file(): consumed 7788 bytes of file transmission
> >>> StarterLog.slot1_4:20-09-02_09:39:37  DoDownload: consuming rest of transfer and failing after encountering the following error: STARTER at 10.150.1.11 failed to write to file /var/lib/condor/execute/dir_34972/.job.ad: (errno 13) Permission denied
> >>> StarterLog.slot1_4:20-09-02_09:39:37  File transfer failed (status=0).
> >>> StarterLog.slot1_4:20-09-02_09:39:37  ERROR "Failed to transfer files" at line 2468 in file /build/condor-8.8.3/src/condor_starter.V6.1/jic_shadow.cpp
> >>> StarterLog.slot1_4:20-09-02_09:39:37  ShutdownFast all jobs.
> >>> StarterLog.slot1_4:20-09-02_09:39:37  condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.150.100.102:24377>.
> >>> StarterLog.slot1_4:20-09-02_09:39:37  IO: Failed to read packet header
> >>> StarterLog.slot1_4:20-09-02_09:39:37  Lost connection to shadow, waiting 2400 secs for reconnect
> >>> StarterLog.slot1_4:20-09-02_09:39:37  All jobs have exited... starter exiting
> >>> StarterLog.slot1_4:20-09-02_09:39:37  **** condor_starter (condor_STARTER) pid 34972 EXITING WITH STATUS 0
> >>>
> >>> /var/lib/condor/execute is 0755 condor:condor on the execute node and bears the above timestamp;
> >>> it contains other active dir_* entries.
> >>> /var is mounted read-only and almost empty.
> >>> On the submit node, /var/lib/condor/execute is empty, and apparently always has been.
> >>>
> >>> Any suggestion how to debug this further?
> >>>
> >>> Thanks,
> >>>  Steffen
> >>>
> >>> --
> >>> Steffen Grunewald, Cluster Administrator
> >>> Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
> >>> Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
> >>> ~~~
> >>> Fon: +49-331-567 7274
> >>> Mail: steffen.grunewald(at)aei.mpg.de
> >>> ~~~
> >>> _______________________________________________
> >>> HTCondor-users mailing list
> >>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> >>> subject: Unsubscribe
> >>> You can also unsubscribe by visiting
> >>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >>>
> >>> The archives can be found at:
> >>> https://lists.cs.wisc.edu/archive/htcondor-users/
> >>
> >>
> >>
> >> --
> >> Mark Coatsworth
> >> Systems Programmer
> >> Center for High Throughput Computing
> >> Department of Computer Sciences
> >> University of Wisconsin-Madison
> >>
> >> _______________________________________________
> >> HTCondor-users mailing list
> >> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> >> subject: Unsubscribe
> >> You can also unsubscribe by visiting
> >> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >>
> >> The archives can be found at:
> >> https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> 



-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~