[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGMAN Workflow Assertion ERROR

Hey Curtis,

I wasn't sure how long this bug fix was going to take because the issue made it appear like the fix may take some finessing of code. However, after some digging around, I was able to implement something seemingly easy. There is still testing and code review that needs to occur. So, the actual fix shouldn't take too long.


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Monday, August 15, 2022 6:29 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] DAGMAN Workflow Assertion ERROR
Hi Cole,

Thanks for taking a look at this and confirming the bug. And thanks for the detailed explanation of a workaround! I tested the rescue dag and that worked for me. However, I ended up using NOOP after all as I was able to change the script that generates the .dag file to only output the PRE and POST scripts, VARS, etc. when the JOB isn't a NOOP.

How long does it typically take for bugs like this to be fixed?



On Mon, Aug 15, 2022 at 7:34 AM Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hey Curtis,

After looking at the code this issue appears to be happening due to some internal changes brought about to DAGMan within the last few years, and not changes directly to how DONE works. I will chat with the dev team about fixing this behavior, but in the meantime a work around to make DAGMan think there is a rescue file. 

To do this you can create a file named <dag file name>.rescue001 which in your case is test.dag.rescue001. In this file just add the line(s) DONE <Node Name> which in your case is DONE A. Then just run condor_submit_dag test.dag. Do not use -f / -force because that will delete your created rescue file and then run all DAG jobs normally. This is a bit hacky but was the only solution I could find until adding DONE to the JOB line is fixed to work as intended.

Best of luck,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer <curtis.spencer@xxxxxxxxxx>
Sent: Thursday, August 11, 2022 6:37 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] DAGMAN Workflow Assertion ERROR
I recently upgraded my HTCondor cluster from 8.6.12 to 9.10.0. I have a DAG file, test.dag, that looks like this:

JOB  A  test.sub DONE
JOB  B  test.sub
JOB  C  test.sub
JOB  D  test.sub

SCRIPT PRE  A  pre.sh
SCRIPT POST  A  post.sh

Running version 8.6.12, `condor_submit_dag test.dag`, would execute just nodes B, C, and D:

But running the version 9.10.0, the entire dag is stuck in idle. Looking at `test.dag.dagman.out`, I see `ERROR "Assertion ERROR on (GetStatus() != STATUS_DONE)" at line 749 in file ./src/condor_dagman/`.

If I remove `DONE` from the first JOB in `test.dag`, everything runs fine.

The documentation says "Users should generally not use the DONE keyword." and to use NOOP instead (https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#job). But I don't see anything about the behavior of `DONE` changing between these two versions. And since DAGMan still uses it, I wouldn't think that using it would result in the Assertion ERROR being thrown.

I don't want to use NOOP because I don't want the PRE and POST scripts to be run and I don't want to have to manually comment-out all of the PRE and POST scripts.

Is there a way to get `test.dag` to run when JOB A is marked as DONE?



HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: