Thank you for your reply!
The failed jobs are not queued anymore â they have crashed (in this case, due to the insufficient disk space for their output). If the jobs were still running, I could have held and then released them to solve the problem. The question is if I can tell HTcondor to run just those failed jobs again if the jobs have crashed and are not running anymore.
When you say the jobs "crashed", the important issue to dagman is
if the job exited with a zero or non-zero exit code.Â If a dag
node job exits with a non-zero exit code, dagman considers the
node to have failed.Â It will not run any nodes that depend on a
failed node, but it will continue to run independent nodes until
it can not make more progress.Â After fixing what failed, dagman
can be re-run and it will just run the failed nodes and their
If however, the job exits with a zero exit code (in theÂ absence of a postscript), dagman assumes the job has succeeded, and continues running dependent jobs.