Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] proposed change in DAGMan

Date: Wed, 15 Jun 2016 17:12:47 -0400
From: Michael V Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] proposed change in DAGMan

From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Date: 06/15/2016 02:12 PM

> We are proposing a change in DAGMan behavior relative to node jobs that > are on hold, and before implementing it, we wanted to get feedback from > the HTCondor user community. > > Right now, DAGMan will wait indefinitely for jobs that are on hold, even > if *all* of the node jobs for the DAG are on hold and, therefore, no > progress is being made. > > The proposed change is that, if DAGMan is "stuck" because all queued node > jobs are on hold (and there are no ready jobs, running PRE/POST scripts, > etc.), DAGMan will consider this a failure and abort the DAG (which > results in all queued node jobs being removed, and a rescue DAG being > generated). > > Users would be able to opt out of the new behavior via a configuration > setting. > > Please let us know what you think of this proposal...
My recently-implement update_job_info hook enables users to run a periodic
hold and periodic release to restart a hung-but-running job - perhaps have
DAGman wait for an update interval to elapse before taking action to insure
that a held job isn't going to be released on the next pass?

-Michael Pelletier.
_

References:
- [HTCondor-users] proposed change in DAGMan
  - From: R. Kent Wenger

Prev by Date: Re: [HTCondor-users] proposed change in DAGMan
Next by Date: [HTCondor-users] Grid Computing, resource is still down
Previous by thread: Re: [HTCondor-users] proposed change in DAGMan
Next by thread: [HTCondor-users] Grid Computing, resource is still down
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] proposed change in DAGMan