[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0



On 8/17/06, Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx> wrote:
Peter F. Couvares wrote:
> Better yet, the easiest thing to do now is just to specify a POST
> script which, when it sees that the job failed, sleeps until it sees
> some special file appear, and then uses that file's (integer) contents
> as its own return code.  Combined with a RETRY, this would allow a
> human to decide whether the node should succeed or retry (by writing a
> 0 or 1 to the special file, respectively), and then let DAGMan do the
> rest.
But it would require to add a sleeping post script to all of the jobs in
the dag because at the job submission
I don't know which one will fail. So for a DAGMan job of 3000 jobs I'll
get 3000 additional post scripts
that wait for the users input (that is only required for lets say 1 out
of 3000). As I said, in this case the failure
can not really be analyzed by scripts.

> Obviously this too is a short-term hack until we can give you
> something better -- but it's MUCH simpler and more robust than your
> current approach.
Ok, I think I could not make myself perfectly clear with my example. If
I had some automatic way to tell that a job failed (or probably failed)
without actually seeing the result `with me own eyes` I would have used
a simpler approach.
And because the execution of the child job is based on the data
presented by the parent when the post script finally runs I no longer
have control over a previously completed dag task. At least none that I
know of. So I can't say "ok, if this Job B failed please re-run
Job A again, and after that do reread your submit file and after that
please start again", because I don't have that kind of control.
Actually nothing can be controlled in dagman, you can't add a new job,
remove an existing, create new dependencies, set a job completed.
All user input is handled through tools that modify the queue (hold,
release, remove) and only indirectly affect dagman.

Ah - I see. What you wish to do is retrospectively "rewrite history"
and "rewind" based on spotting that a job in your chain has gone wrong
and should be run again.

With the increasing complexity of condor I see little likely hood of
any direct queue hacks having much stability.

In theory you could hack the dagman processor so that it could be
stalled. (literally have it poll some special file which lets you say
'stop till this doesn't exist any more'.

At this point you could either:

Terminate the dag and use the recovery dag process to create the
remainder of the dag and manually put in the ones that you wish to re
run.

Get very clever and have the dag rebuild itself internally. God knows
how complex this would be - is Dagman entirely in perl or C as well. I
have never bothered to look.

Stop trying to use Dagman and write your own meta scheduler which
supports what you want it to do.

None of these are nice...

I cannot help but think that your effort might be better spent trying
to programmatically spot the invalid jobs but this may well be hard. I
take it it is something like an anneal/optimize/monte carlo type
effort where the output is not guaranteed to converge?

Matt