Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0

Date: Thu, 17 Aug 2006 21:28:53 +0200
From: Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx>
Subject: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0

Peter F. Couvares wrote:

Better yet, the easiest thing to do now is just to specify a POSTscript which, when it sees that the job failed, sleeps until it seessome special file appear, and then uses that file's (integer) contentsas its own return code. Combined with a RETRY, this would allow ahuman to decide whether the node should succeed or retry (by writing a0 or 1 to the special file, respectively), and then let DAGMan do therest.

But it would require to add a sleeping post script to all of the jobs inthe dag because at the job submissionI don't know which one will fail. So for a DAGMan job of 3000 jobs I'llget 3000 additional post scriptsthat wait for the users input (that is only required for lets say 1 outof 3000). As I said, in this case the failure

can not really be analyzed by scripts.

Obviously this too is a short-term hack until we can give yousomething better -- but it's MUCH simpler and more robust than yourcurrent approach.

Ok, I think I could not make myself perfectly clear with my example. IfI had some automatic way to tell that a job failed (or probably failed)without actually seeing the result `with me own eyes` I would have useda simpler approach.And because the execution of the child job is based on the datapresented by the parent when the post script finally runs I no longerhave control over a previously completed dag task. At least none that Iknow of. So I can't say "ok, if this Job B failed please re-runJob A again, and after that do reread your submit file and after thatplease start again", because I don't have that kind of control.Actually nothing can be controlled in dagman, you can't add a new job,remove an existing, create new dependencies, set a job completed.All user input is handled through tools that modify the queue (hold,release, remove) and only indirectly affect dagman.



Cheers,
Szabolcs

Follow-Ups:
- Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Peter F. Couvares
- Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Matt Hope

References:
- [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Horvátth Szabolcs
- Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Peter F. Couvares
- Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Horvátth Szabolcs
- Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Peter F. Couvares
- Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Matt Hope
- Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
  - From: Peter F. Couvares

Prev by Date: [Condor-users] Andreas Vetter is out of the office.
Next by Date: Re: [Condor-users] Question concerning BOINC backfill and condor
Previous by thread: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
Next by thread: Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0