[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0



Peter F. Couvares wrote:
Better yet, the easiest thing to do now is just to specify a POST script which, when it sees that the job failed, sleeps until it sees some special file appear, and then uses that file's (integer) contents as its own return code. Combined with a RETRY, this would allow a human to decide whether the node should succeed or retry (by writing a 0 or 1 to the special file, respectively), and then let DAGMan do the rest.
But it would require to add a sleeping post script to all of the jobs in the dag because at the job submission I don't know which one will fail. So for a DAGMan job of 3000 jobs I'll get 3000 additional post scripts that wait for the users input (that is only required for lets say 1 out of 3000). As I said, in this case the failure
can not really be analyzed by scripts.

Obviously this too is a short-term hack until we can give you something better -- but it's MUCH simpler and more robust than your current approach.
Ok, I think I could not make myself perfectly clear with my example. If I had some automatic way to tell that a job failed (or probably failed) without actually seeing the result `with me own eyes` I would have used a simpler approach. And because the execution of the child job is based on the data presented by the parent when the post script finally runs I no longer have control over a previously completed dag task. At least none that I know of. So I can't say "ok, if this Job B failed please re-run Job A again, and after that do reread your submit file and after that please start again", because I don't have that kind of control. Actually nothing can be controlled in dagman, you can't add a new job, remove an existing, create new dependencies, set a job completed. All user input is handled through tools that modify the queue (hold, release, remove) and only indirectly affect dagman.


Cheers,
Szabolcs