[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] proposed change in DAGMan

On Wed, Jun 15, 2016 at 2:08 PM, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
> The proposed change is that, if DAGMan is "stuck" because all queued node
> jobs are on hold (and there are no ready jobs, running PRE/POST scripts,
> etc.), DAGMan will consider this a failure and abort the DAG (which results
> in all queued node jobs being removed, and a rescue DAG being generated).

I'm curious as to the motivation for this. If I understand the
proposal correctly, this leaves workflows with a single node at some
level (e.g. diamond DAGs) vulnerable to instant-kaboom if there's a
problem. Sure, the user can just submit the rescue DAG, but that
doesn't help if the submission happens through some intermediary
(which is a common use case for some of our customers).

I think this functionality would be a good addition, but why opt-out
instead of opt-in?


Ben Cotton

Cycle Computing
Better Answers. Faster.

twitter: @cyclecomputing