[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGMAN Workflow Assertion ERROR



Hey Curtis,

After looking at the code this issue appears to be happening due to some internal changes brought about to DAGMan within the last few years, and not changes directly to how DONE works. I will chat with the dev team about fixing this behavior, but in the meantime a work around to make DAGMan think there is a rescue file. 

To do this you can create a file named <dag file name>.rescue001 which in your case is test.dag.rescue001. In this file just add the line(s) DONE <Node Name> which in your case is DONE A. Then just run condor_submit_dag test.dag. Do not use -f / -force because that will delete your created rescue file and then run all DAG jobs normally. This is a bit hacky but was the only solution I could find until adding DONE to the JOB line is fixed to work as intended.

Best of luck,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer <curtis.spencer@xxxxxxxxxx>
Sent: Thursday, August 11, 2022 6:37 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] DAGMAN Workflow Assertion ERROR
 
I recently upgraded my HTCondor cluster from 8.6.12 to 9.10.0. I have a DAG file, test.dag, that looks like this:

```
JOB  A  test.sub DONE
JOB  B  test.sub
JOB  C  test.sub
JOB  D  test.sub
PARENT A CHILD B C
PARENT B C CHILD D

SCRIPT PRE  A  pre.sh
SCRIPT POST  A  post.sh
```

Running version 8.6.12, `condor_submit_dag test.dag`, would execute just nodes B, C, and D:

But running the version 9.10.0, the entire dag is stuck in idle. Looking at `test.dag.dagman.out`, I see `ERROR "Assertion ERROR on (GetStatus() != STATUS_DONE)" at line 749 in file ./src/condor_dagman/`.

If I remove `DONE` from the first JOB in `test.dag`, everything runs fine.

The documentation says "Users should generally not use the DONE keyword." and to use NOOP instead (https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#job). But I don't see anything about the behavior of `DONE` changing between these two versions. And since DAGMan still uses it, I wouldn't think that using it would result in the Assertion ERROR being thrown.

I don't want to use NOOP because I don't want the PRE and POST scripts to be run and I don't want to have to manually comment-out all of the PRE and POST scripts.

Is there a way to get `test.dag` to run when JOB A is marked as DONE?

Thanks,

Curtis