[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] proposed change in DAGMan



From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Date: 06/15/2016 02:12 PM

> We are proposing a change in DAGMan behavior relative to node jobs that
> are on hold, and before implementing it, we wanted to get feedback from
> the HTCondor user community.
>
> Right now, DAGMan will wait indefinitely for jobs that are on hold, even
> if *all* of the node jobs for the DAG are on hold and, therefore, no
> progress is being made.
>
> The proposed change is that, if DAGMan is "stuck" because all queued node
> jobs are on hold (and there are no ready jobs, running PRE/POST scripts,
> etc.), DAGMan will consider this a failure and abort the DAG (which
> results in all queued node jobs being removed, and a rescue DAG being
> generated).
>
> Users would be able to opt out of the new behavior via a configuration
> setting.
>
> Please let us know what you think of this proposal...

My recently-implement update_job_info hook enables users to run a periodic
hold and periodic release to restart a hung-but-running job - perhaps have
DAGman wait for an update interval to elapse before taking action to insure
that a held job isn't going to be released on the next pass?

        -Michael Pelletier.
_