Hello, Situation: I have a large DAG of jobs which is in the process of running. A few jobs failed but most of the jobs in the DAG keep running. From the log files, I have figured out the problem and fixed it. Please, let me know if there is a
way to tell HTCondor to try again the failed nodes (and all of their CHILD nodes, of course) without killing any of the currently running jobs in the same DAG and without waiting for the whole DAG to fail (and generate a rescue file)? From the documentation on condor_submit_dag, I can see that the following command might be a good candidate (I have sub-DAGs): condor_submit_dag -DoRecovery -do_recurse submit_file.dag Please, let me know if that is what I should do. Thank you very much for your help, Siarhei. ............................................................................ Trading instructions sent electronically to Bernstein shall not be deemed For further important information about AllianceBernstein please click here |