[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0



I'm not sure I understand what you're doing -- but I'm not surprised it stopped working, as it's akin to brain surgery on a live, moving patient. :)
Yes, did have some casulties while developing the process... ;)
Actually it does work, just after I sent the mail I re-read the config part of the 6.8.0 docs again and find that DAGMAN_ALLOW_EVENTS=5 basically disables all dag job accounting trickery.
With this setting the surgery works nice and smooth.

But since it is a real hack of a solution I describe my problem in details:
If the issue is jobs which fail sometimes due to factors outside your control, but which succeed if re-submitted, then why not use DAGMan's RETRY feature?
The problem with the retry feature is that sometimes the output of a job has to be checked by the user to decide whether the calculation was successful or not. Sometimes a small problem gets past the error checking mechanisms of the software and it is the user that has to be able to re-submit a job. For example because of a network problem temporally a machine can't access a file or database. Or a DAGMan job that depends on pre-calculated data is accidentally run just before the data is completed by another job. (Real world example right from this day. :))

What I'd like to do is to be able to restart a completed job (that was submitted by dagman) with all its parent (and optionally child) dependencies restarted. Lets say Job A generates some data that Job B uses and deletes after it is completed. If I want to restart Job B I need to run Job A too (to generate the data)
and only after Job A is completed can Job B execute and run successfully.

Now the tricky part is this: if Job A calculates the data locally and modifies the submit file of Job B to tell it where to look for that data than simply restarting Job B does not work, because the job in queue is not in sync with the submit file anymore. So when Job A is rerun not only should it modify the submit file of Job B (just for the records, since its not resubmitted again) but also should modify the attributes of Job B in the queue.

This is what I'd like to achieve without hacking DAGMan's settings with DAGMAN_ALLOW_EVENTS.

Cheers,
Szabolcs

ps I still don't understand why a condor_restart command does not exist. To restart an already completed job I have to use condor_hold and condor_restart every time and sometimes it has a side effect on windows and the job goes into the removed state.



Peter F. Couvares wrote:
Horvátth,

I'm not sure I understand what you're doing -- but I'm not surprised it stopped working, as it's akin to brain surgery on a live, moving patient. :)

If the issue is jobs which fail sometimes due to factors outside your control, but which succeed if re-submitted, then why not use DAGMan's RETRY feature?

If that's not sufficient, please describe the problem in a little more detail. I'm optimistic there's a better solution than using condor_qedit. DAGMan's underlying implementation is obviously subject to change, so relying on a script which circumvents the supported API & semantics is going to be fragile.

-Peter