[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0

Better yet, the easiest thing to do now is just to specify a POST script which, when it sees that the job failed, sleeps until it sees some special file appear, and then uses that file's (integer) contents as its own return code. Combined with a RETRY, this would allow a human to decide whether the node should succeed or retry (by writing a 0 or 1 to the special file, respectively), and then let DAGMan do the rest.

Obviously this too is a short-term hack until we can give you something better -- but it's MUCH simpler and more robust than your current approach.


On Aug 17, 2006, at 12:24 PM, Matt Hope wrote:

On 8/17/06, Peter F. Couvares <pfc@xxxxxxxxxxx> wrote:

I understand now, thanks.  We've actually considered supporting this
precise situation as a first-class feature (registering a "pause" in
the DAG for human intervention).

Let me talk with Kent and get back to you soon about what we might be
able to do to help replace your Rube Goldberg machine... :)

As a hack could you have a job which submitted to the scheduler
universe and waited for user input in some manner (say wrting to a log
or sending an email, whatever floats your boat) then the user can edit
some file/database etc and the 'pause' job polls this till it is happy
you made your decision (beautifully coming back to life if your
machine goes down etc) and indicates as it's return code the choice of
the user.

I'm no dagman user so this might have a hole you can drive a truck
though but it sounds roughly feasible. If the scheduler universe is
not a llowed a startd dedicated to the pause processes with a very
l;arge number of VM's would also do the trick (albeit without playing
as nice if it died)


Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at either