[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] proposed change in DAGMan

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx>
Sent: Wednesday, June 15, 2016 1:31 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] proposed change in DAGMan
On Wed, Jun 15, 2016 at 2:08 PM, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
> The proposed change is that, if DAGMan is "stuck" because all queued node
> jobs are on hold (and there are no ready jobs, running PRE/POST scripts,
> etc.), DAGMan will consider this a failure and abort the DAG (which results
> in all queued node jobs being removed, and a rescue DAG being generated).

> I'm curious as to the motivation for this. If I understand the
proposal correctly, this leaves workflows with a single node at some
level (e.g. diamond DAGs) vulnerable to instant-kaboom if there's a
problem. Sure, the user can just submit the rescue DAG, but that
doesn't help if the submission happens through some intermediary
(which is a common use case for some of our customers).

What if it was a timeout?  In other words, the config setting is "abort if the DAG has been stuck for at least N seconds"?

The motivation is that right now, if a DAG gets into the "stuck" state, it will stay in that state forever unless the user does something (or the node jobs get released somehow), and it's not very obvious to the user what's going on.

> I think this functionality would be a good addition, but why opt-out
instead of opt-in?

Well, if it's opt-in probably very few users will take advantage of it...