I wasn't sure how long this bug fix was going to take because the issue made it appear like the fix may take some finessing of code. However, after some digging around, I was able to implement something seemingly easy. There is still testing and code review
that needs to occur. So, the actual fix shouldn't take too long.
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Monday, August 15, 2022 6:29 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] DAGMAN Workflow Assertion ERROR
Thanks for taking a look at this and confirming the bug. And thanks for the detailed explanation of a workaround! I tested the rescue dag and that worked for me. However, I ended up using NOOP after all as I was able to change the script that generates
the .dag file to only output the PRE and POST scripts, VARS, etc. when the JOB isn't a NOOP.
How long does it typically take for bugs like this to be fixed?
After looking at the code this issue appears to be happening due to some internal changes brought about to DAGMan within the last few years, and not changes directly to how DONE works. I will chat with the dev team about fixing this behavior, but in the meantime
a work around to make DAGMan think there is a rescue file.
To do this you can create a file named <dag file name>.rescue001 which in your case is test.dag.rescue001. In this file just add the line(s) DONE <Node Name> which in your case is DONE A. Then just run condor_submit_dag test.dag. Do not use -f / -force because
that will delete your created rescue file and then run all DAG jobs normally. This is a bit hacky but was the only solution I could find until adding DONE to the JOB line is fixed to work as intended.
Best of luck,
I recently upgraded my HTCondor cluster from 8.6.12 to 9.10.0. I have a DAG file, test.dag, that looks like this:
JOB A test.sub DONE
JOB B test.sub
JOB C test.sub
JOB D test.sub
PARENT A CHILD B C
PARENT B C CHILD D
SCRIPT PRE A pre.sh
SCRIPT POST A post.sh
Running version 8.6.12, `condor_submit_dag test.dag`, would execute just nodes B, C, and D:
But running the version 9.10.0, the entire dag is stuck in idle. Looking at `test.dag.dagman.out`, I see `ERROR "Assertion ERROR on (GetStatus() != STATUS_DONE)" at line 749 in file ./src/condor_dagman/`.
If I remove `DONE` from the first JOB in `test.dag`, everything runs fine.
I don't want to use NOOP because I don't want the PRE and POST scripts to be run and I don't want to have to manually comment-out all of the PRE and POST scripts.
Is there a way to get `test.dag` to run when JOB A is marked as DONE?
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: