[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0



Horvátth,

I understand now, thanks. We've actually considered supporting this precise situation as a first-class feature (registering a "pause" in the DAG for human intervention).

Let me talk with Kent and get back to you soon about what we might be able to do to help replace your Rube Goldberg machine... :)

Thanks,

-Peter



On Aug 17, 2006, at 10:11 AM, Horvátth Szabolcs wrote:

I'm not sure I understand what you're doing -- but I'm not surprised it stopped working, as it's akin to brain surgery on a live, moving patient. :)
Yes, did have some casulties while developing the process... ;)
Actually it does work, just after I sent the mail I re-read the config part of the 6.8.0 docs again and find that DAGMAN_ALLOW_EVENTS=5 basically disables all dag job accounting trickery.
With this setting the surgery works nice and smooth.

But since it is a real hack of a solution I describe my problem in details:
If the issue is jobs which fail sometimes due to factors outside your control, but which succeed if re-submitted, then why not use DAGMan's RETRY feature?
The problem with the retry feature is that sometimes the output of a job has to be checked by the user to decide whether the calculation was successful or not. Sometimes a small problem gets past the error checking mechanisms of the software and it is the user that has to be able to re-submit a job. For example because of a network problem temporally a machine can't access a file or database. Or a DAGMan job that depends on pre- calculated data is accidentally run just before the data is completed by another job. (Real world example right from this day. :))

What I'd like to do is to be able to restart a completed job (that was submitted by dagman) with all its parent (and optionally child) dependencies restarted. Lets say Job A generates some data that Job B uses and deletes after it is completed. If I want to restart Job B I need to run Job A too (to generate the data) and only after Job A is completed can Job B execute and run successfully.

Now the tricky part is this: if Job A calculates the data locally and modifies the submit file of Job B to tell it where to look for that data than simply restarting Job B does not work, because the job in queue is not in sync with the submit file anymore. So when Job A is rerun not only should it modify the submit file of Job B (just for the records, since its not resubmitted again) but also should modify the attributes of Job B in the queue.

This is what I'd like to achieve without hacking DAGMan's settings with DAGMAN_ALLOW_EVENTS.

Cheers,
Szabolcs

ps I still don't understand why a condor_restart command does not exist. To restart an already completed job I have to use condor_hold and condor_restart every time and sometimes it has a side effect on windows and the job goes into the removed state.



Peter F. Couvares wrote:
Horvátth,

I'm not sure I understand what you're doing -- but I'm not surprised it stopped working, as it's akin to brain surgery on a live, moving patient. :)

If the issue is jobs which fail sometimes due to factors outside your control, but which succeed if re-submitted, then why not use DAGMan's RETRY feature?

If that's not sufficient, please describe the problem in a little more detail. I'm optimistic there's a better solution than using condor_qedit. DAGMan's underlying implementation is obviously subject to change, so relying on a script which circumvents the supported API & semantics is going to be fragile.

-Peter