[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0



On Aug 17, 2006, at 2:28 PM, Horvátth Szabolcs wrote:
Peter F. Couvares wrote:
Better yet, the easiest thing to do now is just to specify a POST script which, when it sees that the job failed, sleeps until it sees some special file appear, and then uses that file's (integer) contents as its own return code. Combined with a RETRY, this would allow a human to decide whether the node should succeed or retry (by writing a 0 or 1 to the special file, respectively), and then let DAGMan do the rest.

But it would require to add a sleeping post script to all of the jobs in the dag because at the job submission I don't know which one will fail. So for a DAGMan job of 3000 jobs I'll get 3000 additional post scripts that wait for the users input (that is only required for lets say 1 out of 3000).

No, the POST script should "pause" only if the job fails, and otherwise propagate the job's successful return code. So successful jobs in fact require no human intervention -- but failed jobs get the benefit of human "confirmation", as you wish. For example, the POST script could be as follows:

#!/bin/sh
# magic_pausing_POST_script.sh
job_retval=$1
node_name=$2
if [ $job_retval -ne 0 ]; then
  echo "$node_name failed; waiting for human intervention"
  special_filename=please_continue.$node_name
  while [ ! -f $special_filename ]; then
    sleep 60;
  done
  new_retval=$(<$special_filename)
echo "$special_filename found! Horvátth decided that $node_name should have returned $new_retval"
  rm $special_filename
  return $new_retval
fi
return 0

As long as you specify your DAG like so:

JOB foo foo.sub
SCRIPT POST foo foo.sh $RETURN $JOB
RETRY foo 10

...then the POST script will "know" the job's return code, so it can continue if the job succeeds and pause only if it fails -- and if it pauses, a human gets to decide if it *really* failed and needs to be retried or whether it succeeded, and DAGMan will take care of it.

Again, this is a hack -- but it's a pretty simple and robust one as far as hacks go, and should solve your immediate problem until DAGMan has a runtime API for "brain surgery".

-Peter

--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685