[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] proposed change in DAGMan



We are proposing a change in DAGMan behavior relative to node jobs that
are on hold, and before implementing it, we wanted to get feedback from the HTCondor user community.

Right now, DAGMan will wait indefinitely for jobs that are on hold, even if *all* of the node jobs for the DAG are on hold and, therefore, no progress is being made.

The proposed change is that, if DAGMan is "stuck" because all queued node jobs are on hold (and there are no ready jobs, running PRE/POST scripts, etc.), DAGMan will consider this a failure and abort the DAG (which results in all queued node jobs being removed, and a rescue DAG being generated).

Users would be able to opt out of the new behavior via a configuration setting.

Please let us know what you think of this proposal...

Kent
--
R. Kent Wenger (wenger@xxxxxxxxxxx, 608-262-6627,
http://www.cs.wisc.edu/~wenger/)
Computer Sciences Department
University of Wisconsin-Madison